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Exhibit Q; copies of issued U.S. Patents not provided pursuant to requests from the USPTO), none 
of which contain examples of the "real-world" utilities that the Examiner seems to be requiring. As 
issued U.S. Patents are presumed to meet all of the requirements for patentability, including 
35 U.S.C. §§ 101 and 1 12, first paragraph (see Section VI, below), Applicants submit that the present 
polynucleotides must also meet the requirements of 35 U.S.C. § 101. While Applicants understand that 
each application is examined on its own merits, Applicants are unaware of any changes to 
35U.S.C. § 101, or in the interpretation of 3 5 U.S.C. § 101 by the Supreme Court or the Federal 
Circuit, since the issuance of these patents that render the subj ect matter claimed in these patents, which 
is similar to the subj ect matter in question in the present application, as suddenly non-statutory or failing 
to meet the requirements of 35 U.S.C. § 101. Thus, holding Applicants to a different standard of utility 
would be arbitrary and capricious, and, like other clear violations of due process, cannot stand. 

For each of the foregoing reasons, Applicants submit that as the presently claimed nucleic acid 
molecules have been shown to have a substantial, specific, credible and well-established utility, the 
rejection of claims 1 and 3-20 under 35 U.S.C. § 101 has been overcome, and request that the 
rejection be withdrawn. 

VI. Rejection of Claims 1 and 3-20 Under 35 U.S.C. S 112. First Paragraph 

The.Action next rejects claims 1 and 3-20 under 35 U.S.C. § 1 12, first paragraph, since 
allegedly one skilled in the art would not know how to use the invention, as the invention allegedly is 
not supported by a specific, substantial, and credible utility or a well-established utility. Applicants 
respectfully traverse. 

Applicants submit that as claims 1 and 3-20 have been shown to have "a specific, substantial, 
and credible utility", as detailed in Section V above, the present rejection of claims 1 and 3-20 under 
35 U.S.C. § 1 12, first paragraph, cannot stand. 

Applicants therefore request that the rejection of claims 1 and 3-20 under 35 U.S.C. § 112, 

first paragraph, be withdrawn. 

VII. Conclusion 

The present document is a full and complete response to the Action. In conclusion, Applicants 
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submit that, in light of the foregoing remarks, the present case is in condition for allowance, and such 
favorable action is respectfully requested. Should Examiner Murphy have any questions or comments, 
or believe that certain amendments of the claims might serve to improve their clarity, a telephone call 
to the undersigned Applicants' representative is earnestly solicited. 

Respectfully submitted, 

January 11, 2005 /&*r~>y <2/s^"< 

Date David W. Hibler Reg. No. 4 1 ,07 1 

Agent for Applicants 

LEXICON GENETICS INCORPORATED 
(281) 863-3399 

Customer # 24231 



18 



t 

j 



EXHIBIT A 



JAN 1 3 



2005 

^»>NM_153834 ACCESSION : NM_1 53 834 NID: gi 24475864 ref NM_153834.1 Homo 

sapiens G protein-coupled receptor 112 (GPR112), mRNA 
Length = 8400 

Identities = 972/1199 (81%), Positives = 1/1199 (0%), Gaps = 222/1199 (19%) 
Frame = +1 

Query • 1 MTSSNTQPLLMTSWNIPTAEGSQFPISTTINVPTSNEMETETLHLVPGPLSTFTASQTGL 60 

MTSSNTQPLLMTSWNIPTAEGSQFPISTTINVPTSNEMETETLHLVPGPLSTFTASQTGL 
Sbjct: 5416 MTSSNTQPLLMTSWNIPTAEGSQFPISTTINVPTSNEMETETLHLVPGPLSTFTASQTGL 5595 

Query 61 VSKDVMAMSSIPMSGILPNHGLSENPSLSTSLRAITSTLADVKHTFEKMTTSVTPGTTLP 120 

VSKDVMAMSSIPMSGILPNHGLSENPSLSTSLRAITSTLADVKHTFEKMTTSVTPGTTLP 
Sbjct: 5596 VSKDVMAMSSIPMSGILPNHGLSENPSLSTSLRAITSTLADVKHTFEKMTTSVTPGTTLP 5775 

Query 121 SILSGATSGSVISKSPILTWLLSSLPSGSPPATVSNAPHVMTSSTVEVSKSTFLTSDMIS 180 

SILSGATSGSVISKSPILTWLLSSLPSGSPPAWSNAPHVMTSSTVEVSKSTFLTSDMIS 
Sbjct: 5776 SILSGATSGSVISKSPILTWLLSSLPSGSPPATVSNAPHVMTSSTVEVSKSTFLTSDMIS 5955 

Query: 181 AHPFTNLTTLPSATMSTILTRTIPTPTLGGITTGFPTSLPMSINVTDDIVYISTHPEASS 240 

AHPFTNLTTLPSATMSTILTRTIPTPTLGGITTGFPTSLPMSINVTDDIVYISTHPEASS 
Sbjct: 5956 AHPFTNLTTLPSATMSTILTRTIPTPTLGGITTGFPTSLPMSINVTDDIVYISTHPEASS 6135 

Query 241 RTTITANPRTVSHPSSFSRKTMSPSTTDHTLSVGAMPLPSSTITSSWNRIPTASSPSTLI 300 

RTTITANPRTVSHPSSFSRKTMSPSTTDHTLSVGAMPLPSSTITSSWNRIPTASSPSTLI 
Sbjct: 6136 RTTITANPRTVSHPSSFSRKTMSPSTTDHTLSVGAMPLPSSTITSSWNRIPTASSPSTLI 6315 

Query 3 01 IPKPTLDSLLNIMTTTSTVPGASFPLISTGVTYPFTATVSSPISSFFETTWLDSTPSFLS 3 60 

IPKPTLDSLLNIMTTTSTVPGASFPLISTGVTYPFTATVSSPISSFFETTWLDSTPSFLS 
Sbjct: 6316 IPKPTLDSLLNIMTTTSTVPGASFPLISTGVTYPFTATVSSPISSFFETTWLDSTPSFLS 6495 

Query: 361 TEASTSPTATKSTVSFYNVEMSFSVFVEEPRIPITSVINEFTENSLNSIFQNSEFSLATL 420 

TEASTS PTATKST 

Sbjct: 6496 TEASTS PTATKST - 6534 

Query: 421 ETQIKSRDISEEEMVMDRAILEQREGQEMATISYVPYSCVCQVIIKASSSLASSELMRKI 480 

Sbjct: 

» 

Query 481 KSKIHGNFTHGNFTQDQLTLLVNCEHVAVKKLEPGNCKADETASKYKGTYKWLLTNPTET 540 

EPGNCKADETASKYKGTYKWLLTNPTET 
sb j ct . EPGNCKADETASKYKGTYKWLLTNPTET 6618 

Query: 541 AQTRCIKNEDGNATRFCSISINTGKSQWEKPKFKQCKLLQELPDKIVDLANITISDENPE 600 

AQTRC I KNEDGNATRF S IS INTGKSQWEKPKFKQCKLLQELPDKIVDLANITI SDEN E 
Sbjct: 6619 AQTRC I KNEDGNATRF - S I S I NTGKS QWEKPKFKQC KLLQEL PDKI VDL ANI T I SDENAE 6795 

Query: 601 DVAEHILNLINESPALGKEETKIIVSKISDISQCDEISMNLTHVMLQIINVVLEKQNNSA 660 

DVAEHILNLINESPALGKEETKIIVSKISDISQCDEISMNLTHVMLQIINWLEKQNNSA 
Sbjct: 6796 DVAEHILNLINESPALGKEETKIIVSKISDISQCDEISMNLTHV^LQIINVVLEKQNNSA 6975 

Query: 661 SDLHEISNEILRIIERPGHKMEFSGQIANLAVAGLALAVLRGDHTFDGMAFSIHSYEEGP 720 

SDLHEISNEILRIIER GHKMEFSGQIANL VAGLALAVLRGDHTFDGMAFSIHSYEEG 
Sbjct: 6976 SDLHEISNEILRIIERTGHKMEFSGQIANLTVAGLALAVLRGDHTFDGMAFSIHSYEEGT 7155 



Ouerv 721 DPDIFLGNVPVGGILASIYLPKSLTERIPLSNLQPILFNFFGQTSLFKTKNVTKALTTYV 780 

DP+ TKNVTKALTTYV 
Sbjct: 7156 DPE — — — — — TKNVTKALTTYV 7200 

Query- 781 VSASISD-MFIQNLADPWITLQHIGGNQNYGQVHCAFWDFEl^GLGGWNSSGCKVKET 839 

VSASISD MF I QNLADPWITLQH IGGNQNYGQVHCAFWDFENN GLGGWNSSGCKVKET 
Sbjct: 7201 VSASISDDMFIQNLADPWITLQHIGGNQNYGQVHCAFWDFENN-GLGGWNSSGCKVKET 7377 

Query- 840 NVNYTICQCDHLTHFGVLMDLSRSTVDSVNEQILALITYTGCGISSIFLGVAWTYIAFH 899 

NVNYTICQCDHLTHFGVL.MDLSRSTVDSVNEQILALITYTGCGISSIFLGVAWTYIAF 
Sbjct: 7378 NVNYTICQCDHLTHFGVLMDLSRSTVDSVNEQILALITYTGCGISSIFLGVAWTYIAF- 7554 

Query 900 KLRKDYPAKILINLCTALLMLNLVFLINSWLSSFQKVGVCITAAVALHYFLLVSFTWMGL 9 59 
~" ~ " KLRKDYPAKILINLCTAIjLMLNIjVFLINSWLSSFQKVGVCITAAVALHYFLLVSFTWMGIj 
Sbjct: 7555 KLRKDYPAKILINLCTALLMLNLVFLINSWLSSFQKVGVCITAAVALHYFIjLVSFTWMGL 7734 

Query 960 EAVHMYLALVKVFNIYIPl^ILKFCLVGWGIPAIMVAITVSVKKDLYGTLSPTTPFCWIK 1019 

eavhmylalvkvfniyipnyilkfclvgwgipaimvaitvsvkkdlygtlspttp cwik 

Sbjct: 7735 E AVHMYL ALVKVFN I Y I PNY I LKFC LVGWG I PA I MVA I TVS VKKDL YGTL S PTT P - C W I K 7911 

Query- 102 0 DDSIFYISWAYFCLIFLMNLSMFCTVLVQLNSVKSQIQKTRRKMILHDLKGTMSLTFLL 107 9 

DDSIFYISWAYFCLIFLMNLSMFCTVLVQLNSVKSQIQKTRRKMILHDLKGTMSLTFLL 
Sbjct: 7912 DDSIFYISWAYFCLIFLMNLSMFCTVLVQLNSVKSQIQKTRRKMILHDLKGTMSLTFLL -8091 

Query: 1080 GLTWGFAFFAWGPMRNFFLYLFAIFNTLQGFFIFVFHCVMKESVREQWQIHLCCGWLRLD 1139 

GLTWGFAFFAWGPMRNFFLYLFAIFNTLQ 
Sbjct: 8092 GLTWGFAFFAWGPMRNFFLYLFAIFNTLQ 8178 

Query- 1140 NSSDGSSRCQIKVGYKQEGLKKIFEHKLLTPSLKSTATSSTFKSLGSAQGTPSEISFPN 1198 

DGSSRCQIKVGYKQEGLKKIFEHKLLTPSLKSTATSSTFKSLGSAQGTPSEISFPN 
Sbjct- 8179 ---DGSSRCQIKVGYKQEGLKKIFEHKLLTPSLKSTATSSTFKSLGSAQGTPSEISFPN 8346 
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REFERENCE 
AUTHORS 
TITLE 

JOURNAL 
PUBMED 
COMMENT 

FEATURES 

source 



gene 



CDS 



NM_153834 8400 bp mRNA linear PRI 27-OCT-2004 

Homo sapiens G protein-coupled receptor 112 (GPR112), mRNA. 
NM_153834 

NM_153834.1 GI:24475864 

* 

Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 8400) 

Fredriksson^R., Lagerstrom, M . C . , Hoglund,P.J. and Schioth, H . B . 
Novel human G protein-coupled receptors with long N-terminals 
containing GPS domains and Ser/Thr-rich regions 
FEBS Lett. 531 (3), 407-414 (2002) 
12435584 

PROVISIONAL REFSEQ : This record has not yet been subject to final 
NCBI review. The reference sequence was derived from AY140954 . 1 . 

Location/Qualifiers 

1. .8400 

/organism^ "Homo sapiens" 
/mol_type=" mRNA" 
/db_xref =" taxon: 9606" 
/chromosome^ "X" 
/map="Xq2 6.3" 
1..8400 

/gene="GPR112" 

/ no te=" synonyms: PGR 17 , RPl-299116" 
/ db_xr e f = " Gene I D : 139378 " 
/db_xref = ■ Locus ID : 139378 " 
1..8400 

/gene="GPRH2" 

/note="go_component : integral to membrane Cgoid 0016021 ] 
[evidence IEA] ; 

go_f unction: G-protein coupled receptor activity [goid 
0004930 ] [evidence IEA] ; 

go_process: neuropeptide signaling pathway [goid 0007218] 

[evidence IEA]" 

/codon_start=l 

/product= "G-protein coupled receptor 112" 
/protein .id= " NP_72257 6 . 1 " 
/db_xref="GI : 24475865" 
/db xref ="GeneID: 139378 " 
/db xref ="LocusID: 139378 " 

/ translation " MDDNS R YWMAF S Y I TNNALLGRED I DLGL AGDHQQL I L YRLGKT 
FSIRHHLASFQWHTICLIWDGVKGKLELFLNKERILEVTDQPHNLTPHGTLFLGHFLK 
NESSEVKSMMRSFPGSLYYFQLWDHILENEEFMKCLDGNIVSWEEDVWLVNKIIPTVD 
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RTLRCVPENMTIQEKSTTVSQQIDMTTPSQITGVKPQNTAHSSTLLSQSIPIFATDYT 

TI SYSNTTS PPLETMTAQKI LKTLVDETATFAVDVLSTS SAI SLPTQS I S IDNTTNSM 

KKTKSPSSESTKTTKMVEAMATEIFQPPTPSNFLSTSRFTKNSWSTTSAIKSQSAVT 

KTTSLFSTIESTSMSTTPCLKQKSTNTGALPISTAGQEFIESTAAGTVPWFTVEKTSP 

ASTHVGTASSFPPEPVLISTAAPVDSVFPRNQTAFPLATTDMKIAFTVHSLTLPTRLI 

ETTPAPRTAETELTSTNFQDVSLPRVEDAMSTSMSKETSSKTFSFLTSFSFTGTESVQ 

TVI DAEATRTALTPEITLASTVAETMLS STITGRVYTQNTPTADGHLLTLMSTRSAST 

SKAPESGPTSTTDEAAHLFSSNETIWTSRPDQALLASMNTTTILTFVPNENFTSAFHE 

NTTYTEYLSATTNITPIiKAS PEGKGTT AND ATTAR YTTAVSKLTS PWFANFS IVSGTT 

SITNMPEFKLTTLLLKTIPMSTKPANELPLTPRETWPSVDIISTLACIQPNFSTEES 

ASETTQTEINGAIVFGGTTTPVPKSATTQRLNATVTRKEATSHYLMRKSTIAAVAEVS 

PFSTMLEVTDESAQRVTASVTVSSFPDIEKLSTPLDNKTATTEVRESWLLTKLVKTTP 

RSSYNEMTEMFNFNHTYVAHWTSETSEGISAGSPTSGSTHIFGEPLGASTTRISETSF 

STTPTDRTATSLSDGILPPQPTAAHSSATPVPVTHMFSLPVNGSSWAEETEVTMSEP 

STLARAFSTSVLSDVSNLSSTTMTTALVPPLDQTASTTIVIVPTHGDLIRTTSEATVI 

SVRKTSMAVPSLTETPFHSLRLSTPVTAKAETTLFSTSVDTVTPSTHTLVCSKPPPDN 

IPPASSTHVISTTSTPEATQPISQVEETSTYALSFPYTFSGGGWASLATGTTETSW 

DETTPSHISANKLTTSVNSHISSSATYRVHTPVSIQLVTSTSVLSSDKDQMTISLGKT 

PRTMEVTEMSPSKNSFISYSRGTPSLEMTDTGFPETTKISSHQTHSPSEIPLGTPSDG 

NIjASSPTSGSTQITPTLTSSNTVGVHIPEMSTSLGKTALPSQALTITTFLCPEKESTS 

ALPAYTPRTVEMIVNSTYVTHSVSYGQDTSFVDTTTSSSTRISNPMDINTTFSHLHSL 

RTQPEVTSVASFISESTQTFPESLSLSTAGLYNDGFTVLSDRITTAFSVPNVPTMLPR 

ESSMATSTPIYQMSSLPVNVTAFTSKKVSDTPPIVITKSSKTMHPGCLKSPCTATSGP 

MSEMS S I PVNNS AFTPATVS SDTSTRVGLFSTLLS SVTPRTTMTMQTSTLDVTPVI YA 

GATSKNKMVSSAFTTEMIEAPSRITPTTFLSPTEPTLPFVKTVPTTIMAGIVTPFVGT 

TAFSPLSSKSTGAISSIPKTTFSPFLSATQQSSQADEATTLGILSGITNRSLSTVNSG 

TGVALTDTYSRITVPENMLSPTHADSLHTSFNIQVSPSLTSFKSASGPTKNVKTTTNC 

FSSNTRKMTSLLEKTSLTNYATSLNTPVSYPPWTPSSATLPSLTSFVYSPHSTEAEIS 

TPKTSPPPTSQMVEFPVLGTRMTSSNTQPLLMTSWNIPTAEGSQFPISTTINVPTSNE 

METETLHLVPGPLSTFTASQTGLVSKDVMAMSSIPMSGILPNHGLSENPSLSTSLRAI 

TSTLADVKHTFEKMTTSVTPGTTLPSILSGATSGSVISKSPILTWLLSSLPSGSPPAT 

VSNAPHVMTSSTVEVSKSTFLTSDMISAHPFTNLTTLPSATMSTILTRTIPTPTLGGI 

TTGFPTSLPMSINVTDDIVYISTHPEASSRTTITANPRTVSHPSSFSRKTMSPSTTDH 

TLSVGAMPLPSSTITSSWNRIPTASSPSTLIIPKPTLDSLLNIMTTTSTVPGASFPLI 

STGVTYPFTATVSSPISSFFETTWLDSTPSFLSTEASTSPTATKSTEPGNCKADETAS 

KYKGTYKWLLTNPTETAQTRC I KNEDGNATRFS I S INTGKSQWEKPKFKQCKLLQELP 

DKIVDLANITI SDENAEDVAEHILNLINES PALGKEETKI IVSKI SDI SQCDEI SMNL 

THVMLQIINWLEKQNNSASDLHEISNEILRIIERTGHKMEFSGQIANLTVAGLALAV 

LRGDHTFDGM^FSIHSYEEGTDPETKNVTKALTTYWSASISDDMFIQNLADPWITL 

QH I GGNQNYGQVHC AFWDF ENNGLGGWNS SGC KVKETNVNYT I CQC DHLTH FGVLMDL 
SRSTVDSVNEQILALITYTGCGISSIFLGVAWTYIAFKLRKDYPAKILINLCTALLM 

LNLVFLINSWLSSFQKVGVCITAAVALHYFLLVSFTWMGLEAVHMYLALVKVFNIYIP 

NYILKFCLVGWGIPAIMVAITVSVKKDLYGTLSPTTPCWIKDDSIFYISWAYFCLIF 

LMNLSMFCTVLVQLNSVKSQIQKTRRKMILHDLKGTMSLTFLLGLTWGFAFFAWGPMR 

NFFLYLFAIFNTLQDGSSRCQIKVGYKQEGLKKIFEHKLLTPSLKSTATSSTFKSLGS 

AQGTPSEISFPNAPELSALHWASEPTTG" . 

1 atggatgaca actcaaggta ttggatggcc ttctcttata ttactaataa cgccctcctg 
61 ggcagagaag acatagacct tggacttgca ggagaccatc agcagctaat actatacaga 
121 ttgggaaaga ccttttctat ccgtcaccac ctggcttcat ttcaatggca tacaatatgc 
181 ttgatatggg atggtgtgaa gggcaaatta gaactcttcc tgaataaaga aaggatactg 
241 gaagtaacgg atcaaccaca caacctgaca cctcatggga ctctgttcct agggcacttt 
301 ctcaagaatg agagcagcga ggttaaaagc atgatgcgta gctttcctgg cagcttgtac 
361 tactttcaac tctgggacca catcctggaa aacgaagagt ttatgaagtg tttagatgga 
421 aatatagtta gttgggaaga agacgtctgg cttgtcaaca agatcatccc aactgttgac 
481 aggacactgc gctgcgttcc tgaaaatatg acaattcaag aaaaaagtac aactgtttca 
541 caacagatag atatgaccac tccatcccaa attactggag taaaaccaca aaatactgca 
601 cattcctcta cactattgtc tcaaagcata cctatatttg caactgatta cacaaccata 
661 tcatattcca atacaacatc tccacctctg gaaacaatga ctgcacaaaa aatcttaaag 
721 acactggtag atgagacagc tacatttgca gtggatgttt tatcaacttc atcagccatc 
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781 tctctgccta cccagagtat atccatagac aatactacca attccatgaa aaaaacgaaa. 
841 tctccatctt cagaaagcac aaagacaaca aaaatggttg aagccatggc tactgaaatc 
901 tttcaaccac ctacaccttc taatttccta tccacatcca gatttaccaa gaattcagtt 
961 gtatctacaa cttcagcaat taaatctcag tcggctgtta cgaagacaac atctttattt 
1021 tcaactattg agtcaacatc tatgtctaca acaccttgtc tcaaacaaaa atccacaaat 
1081 actggggcac tccctatctc cacagctggc caggagttca ttgaatctac agctgccgga 
1141 actgtacctt ggtttacagt ggaaaagact tcacctgcat ctactcatgt tgggactgca 
1201 tcatcattcc cacctgagcc tgtgctcatc tccacagctg ctccagtaga ttctgtattt 

12 61 cctagaaacc agacagcatt tccattggca acaactgata tgaaaatagc atttacagtc 
1321 cattcattga ctctcccaac taggcttatt gagaccacac ctgccccaag gacagctgaa 

13 81 acagaattga catctacaaa ttttcaggat gtctctttac ccagagtgga agatgccatg 
1441 tctacttcca tgtcgaaaga gacctcctct aagacctttt ctttcttaac atccttttca 
i5 01 tttactggga ctgagagtgt acagacagtt attgatgctg aagctacacg. tacagcctta 
1561 actcctgaaa tcacacttgc atctacagtg gctgaaacta tgctttcctc cacaatcaca 
1621 ggacgagttt acacccagaa tacacctaca gctgatggac acttgcttac tttgatgtcc 
1681 actagatcag cttccacatc caaggcacct gagtcaggtc ccacatccac aactgatgaa 
1741 gctgcccatc tgttctccag caatgagacc atttggactt ctaggccaga ccaggccctg 
1801 ctggcatcta tgaacacaac caccatactc acatttgtgc ctaatgaaaa ttttacatca 
1861 gcatttcatg agaatactac ttatacagaa tatttatccg caactaccaa tatcacccca 
1921 ctgaaagcat ctccagaggg caaaggtacc actgccaatg atgctactac agccagatat 
1981 acaacagctg tatccaaatt gacatcacca tggtttgcta atttctccat agtttctgga 
2041 accacatcca taaccaatat gcctgaattt aaacttacca ctttactact aaaaacaata 
2101 cctatgtcta caaaacctgc aaatgaactt cctttgacac caagggagac tgttgttcca 
2161 tcagtagata taatatctac tcttgcttgc attcaaccaa atttttctac tgaggaaagt 
2221 gcttctgaga ccacacaaac agaaataaat ggtgcaattg tatttggagg tacaacgacc 
2281 cctgtaccaa agtcagcaac aacacaaaga ttaaatgcca ctgtgacaag aaaagaagca 
2341 acttcccatt atcttatgag aaaatcaact atagcagcag tggctgaggt ttctccattt 
2401 tcaacaatgc tggaagtgac agacgaatca gcacaaaggg tgacagcttc tgtcactgtt 
2461 tcctcttttc ctgatataga aaagctaagt accccattgg ataataaaac tgcaacaact 
2521 gaggtgagag aaagttggct tttgacaaaa ttggtgaaaa ccacacctag gagttcatac 
2581 aatgaaatga cagaaatgtt taattttaac cacacctatg tagcacattg gacttcagag 
2641 acatctgagg gaatttcagc tggatctccc acttctggga gcacacatat attcggtgaa 
2701 cccctgggtg cttctaccac aaggatatca gaaaccagtt tctccactac ccctacagac 
2761 aggacagcta cgtccttgtc tgatggtatc ttacctccac agcctacagc tgctcattcc 
2821 tcagcaaccc ctgtgcctgt tactcatatg ttctcattgc cagttaatgg cagttctgtg 
2881 gtggctgagg agactgaggt taccatgtct gagccttcta cactggccag ggctttttct 
2941 acatctgtgc tctcagatgt ctcaaatcta tcctcaacta caatgaccac agcattggta 
3001 ccacctttgg atcagactgc ttccacaacc attgttattg tgcctaccca tggagacttg 
3061 attcgtacca cttcagaggc ^cacggtaatc tctgtcagga agacatccat ggcagttcct 
3121 tctctgacag aaacaccatt tcattcactg agactctcca ctcctgtgac agctaaggct 
3181 gagaccaccc ttttctctac ctcagttgat' acagtaaccc catctacaca cactcttgtc 
3241 tgctcaaaac ctccccctga caacattcct cctgcgtcct ccactcatgt gatctcaact 
3301 acgtctacac cagaagcaac tcaaccaata tctcaagtag aggagacttc tacctatgct 
3361 ctcagcttcc catatacttt cagtggtggt ggagttgttg ccagcttggc tactggcacc 
3421 acagagacct ctgttgttga tgagaccaca ccctcacaca tctctgccaa taagttgact 
3481 acttcagtaa acagtcacat ttcttcatct gccacatatc gtgtacacac accagtgtcc 
3541 atccagttgg tgactagcac ctctgtctta tcttccgaca aagaccagat gaccatatcc 
3601 ctgggaaaaa cccctagaac tatggaggtg acagaaatgt ccccatcaaa gaattctttt 
3661 atttcatact cccggggtac tccatctttg gaaatgacag atacaggatt tcctgagacc 
3721 acaaaaattt ccagtcacca aacacattcg ccttcagaga ttccacttgg gactccctct 
3781 gatggaaatt tggcttcatc tcccacttct ggaagcacac agattacacc aaccttgacc 
3841 tcaagtaaca cagtaggtgt tcacattcca gaaatgtcta ccagtcttgg gaaaacagct 
3901 ctcccctcac aagctctgac aatcaccact tttttgtgtc ctgaaaagga aagcacgagt 
3961 gcccttccag catatactcc caggactgtg gaaatgatag taaactccac ctatgtgact 
4021 cactctgtct catatggcca ggatacttca tttgtagata ccacaacttc cagctcaaca 
4081 aggatatcaa atcctatgga catcaataca actttttcac acttgcattc acttaggaca 
4141 caacctgagg tgacttcagt tgcctctttc atttctgaaa gcacacagac tttccctgag 
4201 tccttgtctc tttccacagc tggactatat aatgacggtt ttacagttct ctccgacagg 
4261 atcactacag ccttttctgt tccaaatgta cctacaatgc ttcctagaga atcctctatg 
4321 gcaacgtcca ctcctattta ccagatgtcc tcattgccag ttaatgtaac tgccttcacc 
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4381 tccaaaaaag tttctgacac tcccccaata gtgataacta aatcttctaa aacaatgcat 
4441 ccaggttgtt tgaaaagtcc ctgtacagcc acttctgggc ctatgtctga gatgtcctca 
4501 ataccagtta ataactctgc tttcacacct gcaacagtct cttctgacac ttccacaaga 
4561 gttgggttat tctctacttt attgtcttca gttaccccca ggactactat gaccatgcaa 
4621 acatctacat tggatgtcac acctgtgata tatgctgggg^ctacttcaaa aaacaaaatg 
4681 gtttcctctg ctttcactac agaaatgata gaggcacctt ccaggatcac acctacgacc 
4741 tttctctctc caacagagcc aactttgccc tttgtaaaaa ccgttcccac caccattatg 
4801 gctgggatag tgactccatt tgtaggcacc actgccttct ctccactcag ttctaagagc 
4861 actggagcta tttcctccat tccaaagacc acattttcac catttctatc agcaactcaa 
4921 cagtcatcac aagcagatga ggctacaact ttgggcatat tatctgggat tactaacagg 
4981 tccctatcta ctgtgaaeag tggtacaggg gtagctctca cagatactta ttccagaatc 
5041 actgttcctg aaaatatgct ttcacctact catgcagata gtctccatac ttccttcaat 
5101 attcaggttt ccccatctct gactagcttt aagagtgctt ctggacccac aaaaaatgtt 
5161 aaaacaacca ccaattgctt ttcttctaat actagaaaga tgacttcctt gttagaaaag 
5221 acttccttaa caaactatgc cacatctttg aatacccctg tttcataccc tccatggacc 
5281 ccatccagtg caactctacc ctctttgaca tcatttgttt attcacctca tagtactgaa 
5341 gctgagatct ctactccaaa gacctctcct cctcccacat cccaaatggt tgaatttcca 
5401 gttctgggaa caagaatgac atctagtaat acccaacctc tgcttatgac ttcctggaac 
5461 atacccacag ctgaaggttc tcagtttcca atttccacca ctattaatgt acctacatcc 
5521 aatgagatgg aaacagagac tctacacctt gttcctgggc ctttgtcaac attcacagcc 
5581 tctcagactg gtctagtatc taaagatgtc atggcaatgt catcaattcc tatgtcagga 
5641 attcttccta accatgggct ttctgagaac ccttcattat caacatcttt aagagctatc 
5701 acttccacat tggctgacgt taagcacaca tttgagaaaa tgaccacatc tgtaactcct 
5761 gggaccacac tcccatcaat tctttctggt gccacttcag gatctgtaat ttcaaagtca 
5821 cccattctga catggctctt atctagtctc ccttctggct cccctccggc aactgtatct 
5881 aatgcccctc atgttatgac ttcctctaca gtagaggtgt caaaatcaac atttctgaca 
5941 tctgacatga tatcagcgca cccattcact aacttgacaa cactaccctc tgctactatg 
6001 agcaccatac tcacccgaac cattcctaca cctacactgg gtggtatcac tactggcttc 
6061 ccaacttctc tccctatgtc tataaatgtc acagatgaca ttgtgtacat ttccacacac 
6121 cctgaggcat cctccagaac cacaataact gccaacccca ggactgtgtc tcatccttca 
6181 tccttcagca gaaagactat gtcaccttct acaactgacc acactctatc tgttggtgcc 
6241 atgcctctgc ctagctctac aataacatct tcatggaaca gaattccaac tgcatcatca 
6301 ccctctactt taattattcc taagcccaca ctggactccc ttctaaatat aatgactact 
6361 acatccactg ttcctggagc ctcatttcca ctcatatcca ctggggtgac atatcctttt 
6421 acagcaactg tgtcttcacc aatatcgtcc ttttttgaaa caacttggct ggactccaca 
6481 ccttcctttc tatctacgga agcatcgact tcgcctactg ccaccaagtc cacagagcct 
6541 ggaaattgca aagctgatga aacagcctct aaatacaaag ggacctataa gtggctatta 
6601 accaacccta cggagacagc ccaaaccaga tgcataaaaa atgaggatgg aaatgccaca 
6661 agattctcaa tcagcatcaa ^cacgggcaaa tctcagtggg aaaagccaaa gtttaaacaa 
6721 tgcaaattgc ttcaagaact tcctgacaag attgtggatc ttgctaatat taccataagt 
6781 gatgagaatg ctgaggatgt tgcagagcatf attttaaatt tgataaatga atccccagcc 
6841 ctgggtaaag aagagacaaa gattattgtt tctaaaatat cagatatttc acaatgtgat 
6901 gagataagta tgaacctaac tcatgttatg ttacaaataa tcaacgttgt tttggaaaag 
6961 caaaacaatt ccgcctctga tctgcatgaa ataagcaatg agattctgag gataattgag 
7021 cgtactggtc acaagatgga gttttctggg cagatagcaa atctgacggt ggccgggctg 
7081 gctttggctg tgctgcgggg ggaccacacg tttgatggca tggctttcag cattcactcc 
7141 tatgaagaag gcacagaccc tgagaccaaa aatgtcacta aagcattaac cacctatgtt 
7201 gtgagtgcca gcatttcaga tgatatgttc attcaaaact tagctgaccc agtggttatc 
7261 actctgcagc atattggagg aaaccagaat tatggtcaag ttcactgtgc cttttgggat 
7321 tttgagaata atgggctggg tggatggaat tcgtcaggct gtaaagtaaa ggaaacaaat 
7381 gtaaattaca caatctgtca gtgtgaccac ctcacccatt ttggagtctt aatggattta 
7441 tccaggtcta cagtggattc agtgaatgaa cagatattag cgcttataac atacaccgga 
7501 tgtggaatct cctccatttt tctgggagtt gcagtggtga catacatagc ttttaaactt 
7561 cgaaaagatt atcctgccaa aattctgatc aacctgtgca cagcactact gatgctaaac 
7621 ctggtatttt tgatcaattc ttggttgtca tcatttcaga aagtgggagt ttgtatcaca 
7681 gctgcagtgg cacttcatta cttcctgctt gtttctttta cttggatggg cctggaggca 
7741 gtccacatgt atttggctct agtcaaagtc ttcaacatat acattccaaa ttatatcctt 
• 7801 aaattttgtc tagttggttg gggaatcccg gctatcatgg tggcaatcac agtcagtgtg 
7861 aaaaaagatc tgtatggaac tctgagccca acaactccgt gttggattaa agatgattct 
7921 atcttttaca tctcagtggt ggcttatttt tgcctcatat ttctcatgaa tctctccatg 
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7981 ttctgcactg ttcttgttca actgaattct gtgaaatccc aaatccagaa gactcggcgg 

8041 aagatgatcc tgcatgacct caaaggcaca atgagcctga cattcttact tggcctcacc 

8101 tgggggtttg cattttttgc ttggggaccc atgaggaact ttttcttgta tttgtttgcc 

8161 atttttaaca ctttgcaaga tgggagcagc cggtgtcaga taaaggttgg atataaacag 

8221 gagggactaa agaaaatctt tgagcacaaa ctgttgacgc catctctcaa gtcaactgca 

8281 actagctcca ctttcaaatc tttaggctct gcacaaggca caccttcaga aataagcttt 

8341 ccaaatgctc cagagctcag jt-gccctgcat gtggtggctt cagagcccac tactggttaa 
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Novel human G protein-coupled receptors with long N-terminals 
containing GPS domains and Ser/Thr-rich regions. 

Fredriksson R, Lagerstrom MC, Hoglund PJ, Schioth HB. 

Department of Neuroscience, Uppsala University, BMC, Box 593, 751 24, 
Uppsala, Sweden. 

We report eight novel members of the superfamily of human G protein-coupled 
receptors (GPCRs) found by searches in the human genome databases, termed 
GPR97, GPR110, GPR111, GPR112, GPR113, GPR114, GPR115 and 
GPR116. Phylogenetic analysis shows that these are additional members of a 
family of GPCRs with long N-termini, previously termed EGF-7TM, LNB- 
7TM, B2 or LN-7TM. Five of the receptors form their own phylogenetic cluster, 
while three others form a cluster with the previously reported HE6 and GPR56 
(TM7XN1). All the receptors have a GPS domain in their N-terminus and long 
Ser/Thr-rich regions forming mucin-like stalks. GPR113 has a hormone binding 
domain and one EGF domain. GPR1 12 has over 20 Ser/Thr repeats and a 
pentraxin domain. GPR116 has two immunoglobulin-like repeats and a SEA 
box. We foynd several human EST sequences for most of the receptors showing 
differential expression patterns, which may indicate that some of these receptors 
participate in reproductive functions while others are more likely to have a role 
in the immune system. 
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FULL-TEXT ARTICLE 



The human and mouse repertoire of the adhesion family of G- 
protein-coupled receptors. 

Bjarnadottir TK, Fredriksson R, Hoglund PJ, Gloriam DE, Lagerstrom 
MC, Schioth HB. 

Department of Neuroscience, Uppsala University, BMC, Box 593, 751 24, 
Uppsala, Sweden. 

The adhesion G-protein-coupled receptors (GPCRs) (also termed LN-7TM or 
EGF-7TM receptors) are membrane-bound proteins with long N-termini 
containing multiple domains. Here, 2 new human adhesion-GPCRs, termed 
GPR133 and GPR144, have been found by searches done in the human genome 
databases. Both GPR133 and GPR144 have a GPS domain in their N-termini, 
while GPR144 also has a pentraxin domain. The phylogenetic analyses of the 2 
new human receptors show that they group together without close relationship 
to the other adhesion-GPCRs. In addition to the human genes, mouse 
orthologues to those 2 and 15 other mouse orthologues to human were identified 
(GPR110, GPR111, GPR112, GPR113, GPR114, GPR115, GPR116, GPR123, 
GPR124, GPR125, GPR126, GPR128, LEC1, LEC2, and LEC3). Currently the 
total number of human adhesion-GPCRs is 33. The mouse and human 
sequences show a clear one-to-one relationship, with the exception of EMR2 
and EMR3, which do not seem to have orthologues in mouse. EST expression 
charts for the entire repertoire of adhesion-GPCRs in human and mouse were 
established. Over 1600 ESTs were found for these receptors, showing 
widespread distribution in both central and peripheral tissues. The expression 
patterns are highly variable between different receptors, indicating that they 
participate in a number of physiological processes. Copyright 2003 Elsevier Inc. 
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EXHIBIT C 



characterize the protein, A starting material that can only be used to produce 
a final product does not have a substantial asserted utility in those instances 
Where the. final product is not supported by a specific and substantial utility. 
In this case none of the proteins that are to be produced as final products 
resulting from processes involving the claimed cDNA have asserted or 
identified specific and substantial utilities. The research contemplated by 
Applicants to characterize potential protein products, especially their 
biological activities, does not constitute a specific and substantial utility. 
Identifying and studying the properties of the protein itself or the 
mechanisms in which the protein is involved does not define a "real world- 
context of use. Note, because the claimed invention is not supported by a 
specific and substantial asserted utility for the reasons set forth above, 
credibility has riot been assessed. Neither the specification as filed nor any ~ 
art of record discloses or suggests any property or activity for the cDNA 
compounds such that another non-asserted utility would be well established 
for the compounds. 

Claim 1 is also rejected under 35 U.S.C. § 1 12, first paragraph. 
Specifically, since the claimed invention is not supported by either a specific 
and substantial asserted utility or a well established utility for the reasons set 
forth above, one skilled in the art would not know how to use the claimed 
invention. 

Example 10: r>NA Fragment eroding a Full Open Reading Frame 
(ORF) 

Specification: The specification discloses that a cDNA library was prepared 
from human kidney epithelial cells and 5000 members of this library were 
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sequenced and open reading frames were identified. The specification 
discloses a Table that indicates that one member of the library having SEQ 
ID NO; 2 has a high level of homology to a DNA ligase. The specification 
teaches that this complete ORE (SEQ ID NO: 2) encodes SEQ ID NO: 3. 
An alignment of SEQ ID NO: 3 with known amino acid sequences of DNA 
ligases indicates that there is a high level of sequence conservation between 
the various known ligases. The overall level of sequence similarity between 
SEQ ID NO: 3 and the consensus sequence of the known DNA ligases that 
are presented in the specification reveals a similarity score of 95%. A search 
of the prior art confirms that SEQ ID NO: 2 has high homology to DNA 
Ligase encoding nucleic acids and that the next highest level of homology is 
to alpha-actin. However, the latter homology is only 50%. Based on the 
sequence homologies, the specification asserts that SEQ ID NO: 2 encodes a 
DNA ligase. 

Claim 1: An isolated and purified nucleic acid comprising SEQ ID NO: 2. 

Analysis: The following analysis includes the questions that need to be 
asked according to the guidelines and the answers to those questions based 
on the above facts: 

• ° 1) Based on the record, is.there a rwell established utility' ' for the 
claimed invention? Based upon applicant's disclosure and the results of the 
PTO search, there is no reason to doubt the assertion that SEQ ID NO: 2 
encodes a DNA ligase. Further, DNA ligases have a well-established use in 
the molecular biology art based on this class of protein's ability to ligate 
DNA. Consequently the answer to the question is yes. 
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Note that if there is a well-established utility already associated with the 
claimed inventiqn, the utility need not be asserted in the specification as 
filed. In order to determine whether the claimed invention has a well- 
established utility the examiner must determine that the invention has a 
specific, substantial and credible utility that would have been readily • 
apparent to one of skill in the art. In this case SEQ ID NO: 2 was shown to 
encode a DNA ligase that the artisan would have recognized as having a 
specific, substantial and credible utility based on its enzymatic activity. 

Thus, the conclusion reached from this analysis is that a 35 U.S.C. §- 
101 rejection and a 35 U.S.C. § 1 12, first paragraph, utility rejection should 
not be made. 

Example 11: Animals with TJncharactprized Human Genes 

Specification: Kidney cells from a patient with Polycystic Kidney (PCK) 
Disease have been used to make a cDNA library. From this library 8000 
nucleotide "fragments" have been sequenced but not yet used to express 
proteins in a transformed host cell nor have they been characterized in any 
other way. The 50 longest fragments, SEQ ID NO: 1-50, respectively, have 
been used to make transgenic mice. None of the 50 lines of mice have 
developed Polycystic Kidney Disease to date. The asserted utility is the use 
of the mice to research human genes from diseased human kidneys. The 
disease is inheritable, but chromosomal loci have not yet been identified. 
Neither the absence or presence of a specific protein has been identified with 
the disease condition. 
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Related Articles. Links 



Peptide-binding G protein-coupled receptors: new opportunities 
for drug design. 

Gurrath M. 

Heinrich-Heine University, Pharmaceutical Chemistry, Universitatsstr. 1, 40225 
Dusseldorf, Germany, gurrath@pharm.uni-duesseldorf.de 

Over the last decades distinct members of the G Protein-Coupled Receptor 
(GPCR) family emerged as prominent drug targets within pharmaceutical 
research, since approximately 60 % of marketed prescription drugs act by 
selectively addressing representatives of that class of transmembrane signal 
transduction systems. It is noteworthy that the majority of GPCR-targeted drugs 
elicit their biological activity by selective agonism or antagonism of biogenic 
monoamine receptors, while the development status of peptide-binding GPCR- 
addressing compounds is still in its infancy. Exemplified on selected medicinal 
chemistry projects, this review will focus on the opportunities of therapeutic 
intervention into a broad spectrum of disease processes through agonizing or 
antagonizing the functions of peptide-binding GPCRs. In this context, a brief 
overview of GPCR-mediated signal transduction pathways will be given in 
order to emphasize the biomedical relevance of a controlled modulation of 
receptor function. Modern trends on lead finding and optimization strategies for 
peptide-binding GPCR-targeted low-molecular weight compounds will be 
highlighted on the basis of current research programs conducted in the areas of 
angiotensin II, endothelin, bradykinin, neurokinin, neuropeptide Y, LHRH, C5a 
antagonists, and somatostatin agonists, respectively. Special emphasis will be 
laid on the elaboration and utilization of structural rationales on the potential 
drug candidates, thus facilitating more detailed insights into the underlying 
molecular recognition event. 

Publication Types: 

• Review 

• Review, Tutorial 
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the human genome was ge nested by he who g ^ g months f 

me thod. The 14.8-billion bp DNAs " q " e " C ? 5 W 1l .fold coverage of the genome) 
27.271.853 high-quality ^^'llf^ the dNA of five individuals. Two 
from both ends of P^sm.d clones made from t ^ chromosome 

assembly strategies-a ^SnSS^S^ data fr ° m ^ "Eita 
assembly^were used. "^'Xuc data were shredded into 550-bp 
publicly funded genome e ffort The puwj c ns that had been 

segments to create a 2.9-fold coverage of those g ^ ajsembly 

'sequenced, without including brought the effective cov- . 

procedure used by *e publ.dy funde P an(J-size p f gaps m 

erage in the. assemblies to e.ghtfo J reducm^n ^ coverage . Th e 

thecal assembly over «*£^*S£^ that largely agree with 
two assembly strateg.es /*»* d JJ^JJJJ ^ cover the euchromat c 
independent mapping data. The =«embues rf ^ e 

regions of the human ^romosomes More tn ^ mg . , 

sctffold assemblies of 100.000 genom e sequence revea ed 

scaffolds of 10 million bp or Sa* was strong corroborat.ng 

26 588 protein-encoding transcr.pU for wh.cn tn s with mouse 

evidence and an additional -12.000 c^JJ^ gene -dense clusters are 

matches or other weak supportmg *£'"£*™Z C + C se < uenCe 
obvious, almost half the genes are Qnly r1 * of the genome 

£ Urge tracts of apparently ^ of the genome being 

is spanned by exons. whereas 24 /. is in intron . s . ie tQ chr0 . 

inte P rgenic DNA. Duplications of : ^^"^^enome and reveal a complex 
mosomal lengths, are "b""* n ^^ e ^s jndicates vertebr ate ex- 
evolutionary history. Comparative genomic a y tissue-$pec.f .c de- 

pansions of gene.s associated « neuronal tun g system ^ 

velopmental regulation^and «th the hemo ^ s and bUd y funded 

sequence comparisons between ^.""J?"^,^ Nucleotide polymorphisms 

genome data provided locations differed at a rate of 1 bp par 

(SNPs). A random pair of human ha^loid genom ^ f , 

1250 on average, but there ^^f*^^ bU sNPs resulted in var.at.on in 

rroSh^s 

remains an open challenge. 
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mNA using chain-terminating nucleonde ana- 
log). In Tthe same year, the first human ^gene 
vS solated and sequenced (4). In 19S6. Hood 
co-workers tf» described an ^provemem 
Sanger sequencing method that mcluded 
Aching fluorescent dyes to the nudeon^ 
which permitted them to be sequenhdty read 
Tt computer. The first automated DNA se- 
quencer, developed by Applied i Bio = to 
California in 1987, was shown to be suwa™ • 
wtnTe sequences of two * 

with this new technology W-.^.^m it 
ouencing of human genomic regions </>, k 
became clear that cDNA sequences (which are 
SSeWribed from RNA) would be «- 

arts ssr - ~ 

pSseT sequence tag (EST) method of gene 
S tification (o), which is a random s« 
very high throughput sequencmg^pmach^ 
characterize cDNA libranes. Jhe KT metnoa 
led to the rapid discovery and mappm. oi nu 
lea to m v increasing numbers of hu- 
^ S sSue?c s tccessiuted the develop- 

rn'ofnewToV-^^^^ 

Zgc amounts of sequencedata, and»WW 

?n e Institute for Genomic R«^ ch ; SenT. 
algorithm was *"**^S££S« 
bly and analysis of ^reds ^ charact eriza- 

dagenomeW^ncewasdetemunedbya 

was discussed and subsequently rqectea £ 
in a to the lack of appropriate software too 
for genome assembly. 
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neously map and sequence the human ge- 
nome by means of end sequences -from 150- 
kbp bacterial artificial chromosomes (BACs) 
(17, 18). The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the 
BAC end-sequencing (BES) method was ap- 
plied successfully to complete chromosome 2 
from the Arabidopsis thaUana genome {19). 

In 1997, Weber and Myers (20) proposed 
whole-genome shotgun sequencing of the 
human genome. Their proposal was not well 
received (21). However, by early 1998, as 
less than 5% of the genome had been se- 
quenced, it was clear that the rate of progress 
in human genome sequencing worldwide 
was very slow {22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- 
throughput capillary DNA sequencer, subse- 
quently called the ABI PRISM 3700 DNA 
Analyzer. Discussions between PE Biosystems 
and TIGR scientists resulted in a plan to under- 
take the sequencing of the human genome with 
the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed at 
TIGR (23). Many of the principles of operation; 
* of a genome-sequencing facility were estate, 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a 
capacity roughly 50 times that of TIGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence bf the 120-Mbp eucnromatic" 
.portion of ihe Drosophila genome -was "deterr - 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- . 
rithms could generate chromosome assemblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, and (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly was 
not of value. 

These findings, 'together with the dramatic 
changes in the public genome effort'subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of the 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 
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coverage and to use the unordered and unori- 
ented BAC sequence fragments and subassem- 
blies published in GenBank by the publicly 
funded genome effort (30) to accelerate the 
project We also abandoned the quarterly an- 
nouncements in the'absence of interim assem- 
blies to report 

Although this strategy, provided a reason- 
able result very early that was consistent with a 
whole-genome shotgun .assembly with eight- 
fold coverage, the human genome sequence is 
not as finished as the Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could generate an accu- 
rately ordered and oriented scaffold sequence of 
the human genome in less than 1 year. Human 
genome sequencing was initiated 8 September 
1999 and completed 17 June 2000. The first 
assembly was completed 25 June 2000, and the 
assembly reported here was completed 1 Octo- 
ber 2000. Here we describe the whole-genome 
random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the ~3 
billion bp that make up the 23 pairs of chromo- 
somes of the Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
potential bias to the final sequence from chi- 
meric clones, foreign DNA contamination, or 
misassembled contigs. Insofar as a correcdy 
and accurately assembled genome sequence 
with faithful order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this . manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe our 
preliminary analysis of the human genetic 
code on the basis of computational methods. 
Figure 1 (see" fold-out chart associated with 
this issue; files for each chromosome can be 
found in Web fig., 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1 304/DCi) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just beginning. 

To aid the reader in locating specific an- 
alytical sections, we' have divided the paper 
into seven broad Sections. A summary of the 
major results appears at the beginning of each 
section. ; ; > ;. " 

1 Sources of DNA and Sequencing Methods 

2 .Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Structure 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencing * 
Methods * 6 

Summary. This section discusses the rati oruJc 
and ethical rules governing donor selection to 
ensure ethnic and gender diversity along with 
the methodologies for DNA extraction and li- 
brary construction. The pi as mid library con- 
struction is the first critical step in shotgun 
sequencing. If the DNA libraries are not uni- 
• form in size, nonchimeric, and do not randomly 
represent the genome, then the subsequent steps 
cannot accurately reconstruct the genome se- 
quence. We used automated high-throughput 
DNA sequencing and the computational infra- 
structure to . enable efficient, tracking of chop 
mous amounts of sequence information (27.3 
million sequence reads; 14.9 billion bp or sc. 
quence). Sequencing and tracking from both 
ends of plasmid clones from 2-, 10-, and 50-kbp 
libraries were essential to the computational 
reconstruction of the genome. Our evidence 
indicates that the accurate pairing rate of end 
sequences was greater than 98%. 

Various policies of the United States and the 
World Medical Association, specifically the 
Declaration of Helsinki, offer recommenda- 
tions for conducting experiments with human 
subjects. We convened an Institutional Re- 
view Board (LRB) (31) that helped us estab- 
lish the protocol for obtaining and using hu- 
man DNA and the-informed consent process 
used to enroll research volunteers for the 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy rights and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for speci- 
mens and records, circumscribed contact with 
the. subjects by researchers, and options for 
off-site contact of donors. In addition, Celera 
applied'for and received a Certificate of Con- 
fidentiality from the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of the 
individuals who volunteered to be donors as 
provided in Section 301(d) of the Public 
Health Service Act 42 U.S.C. 24 1 (d). m 
Celera and the IRB believed that the ini- 
tial, version of a completed human genome 
should be a composite derived from multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, on a voluntary 
basis, to self-designate an ethnogeograpnic 
category (e.g., African-Arnerican, Chinese 
Hispanic, Caucasian, etc.), We enrolled ^ 

donors (32). 

. Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age. 
sex, and self-designated ethnogeograpnic 
group. From females, ~ 130 ml of whole, 
heparinized blood was collected. From male . 
~ 130 ml of whole, heparinized blood was 
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collected, as well as five specimens of se. , 
collected over a 6-week period. Permanent 
lympaoblastoid cell lines were created by 
Eostein-Barr virus immortalization. DMA 
from five subjects was selected for genomic 
DNA sequencing': two males and three fe- 
males—one African-American, one Asian- 
Chinese, one Hispanic-Mexican, and fcvo 
Caucasians (see Web fig. 2 on Science Onhne 
S W vw.sciencemag.org/cgi/content«91/5507/ 

1304/DC1). The. decision of whose. DNA to 
sequence was based on a complex mix offac- 
; to^includingthe goal of achieving ^ets^as 
well as technical issues such as the quality of 
the DNA libraries and availability of immortal- 
ized cell lines. 

1.1 Library construction and 
sequencing 

Central to the whole-genome shotgun sequenc- 
ing process is preparation of high-quality ptes- 
nud libraries in a variety of insert sizes so that 
pans of sequence reads (mates) are obtained, 
one read from both ends of each plasmid insert 
High-quality libraries have an equal representa- 
tion of all parts of the genome, a small number 
of clones without inserts, and no contamination 
from such sources as the mitochondnal [genome 
and Escherichia coli genomic DNA. DN A from 
each donor was used to construct plasirud librar- 
ies in one or more of three size classes: 2 kbp, 10 
kbp, and 50 kbp (Table 1) (33). ........ 

In designing the DNA-sequenctng pro- 
cess we focused on developing a simple 
system that could be implemented in a robust 
and reproducible manner and monitored et- 

fectively (Fig. 2) (34). . 
Current sequencing protocols are based on 
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the dideoxy sequencing mefljod C**^ 
typically yields only 500 to 7.0 bp of wquence 
per reaction. This limitation on read lengdi has 
made monumental gains in throughput ^re- 
quisite for the analysis of large eukaryohc 
Genomes We accomplished this at the Celera 
facX wWch occupies about 30,000 square 
£ laboratory space and P^se^ce 
data continuously at a rate of 17 5 .°°° l ° 
reads per day. The DNA-sequenong facdiryu 
supposed by a hi^^ffctin ! «ce..c9aipu^ 

ular by design and automated. Intermodule 
sample backlogs allowed four pnnc.pa 
modules to operate independently. ( ' 
brary transformation, plating, and co ony 
picking; (ii) DNA template 
(iii) dideoxy sequencing reaction set up 
Ld purification; and (iv) sequence do- 
mination with the ABI PRISM 3700 DNA 
Tizer. Because the inputs and output 
of each module have been carefiil y 
matched and sample backlogs are continu- 
ously managed, sequencing has proceeded 

without a single day's i^P^ l " C ^ y 
initiation of the Drosophila project in May, 
1999 The ABI 3700 is a fully automated 
caoiliary array sequencer and as such can 
'J operated with a minimal amount o 
hands-on time, currently estimated at about 
15 min per day. The capillary system also 
kite's correct associations of ^sequenc- 
L traces with samples through the e hmi- 
rSioh of manual sample load ng and lane- 
• • "Sng errors- associated with slab gel s^ 
About 65 production staff were hired and 
• traS. and were rotated on a regular basis 



hrough the four production modules. A 
central laboratory information management 
system (LIMS) tracked all sample plates by 
unique bar code identifiers. The facility was 
supported by a quality control team that per- 
formed raw material and in-process testing 
and a quality assurance group with responsi- 
bilities including document control, valida- 
tion, and auditing of the facility. Critical to 
the success of the scale-up was the validation 
of all software and instrumentation, before 
'implementation; and production-scale testing 
' of any process changes. - ■ 

1.2 Trace processing 
An automated trace-processing pipeline has 
been developed to process each sequence file 
(37). After quality and vector trimming, the 
average trimmed sequence length was 
bp, and the sequencing accuracy was expo- 
nentially distributed with a mean of 99.5 /o 
and with less than 1 in 1000 reads being less 
than 98% accurate (26). Each trimmed se- 
quence was screened for matches to contam- 
Liants including sequences of vector alone J 
coli genomic DNA, and human mitochondri- 
al DNA The entire read for any sequence 
with a significant match to a contaminant was 
mscarde^A total of 713 reads matched R 
coli genomic DNA and 2114 reads matched 
the human mitochondrial genome. 

1 3 Quality assessment and control 
The importance of the base-pair level ac- 
" curacy bf the sequence data increases as the 
■ Sze and repetitive nature of the genome to 
be sequenced increases. &ch sequence 
read must be placed uniquely in the ge- 



Table 1. Celera-generated data input into assembly. 



No. of sequencing reads 



Fold sequence coverage , , 
- (2.9-Gb genome) .... 



Fold clone coverage 



m 

Insert size* (mean) 
Insert size* (SD) 
% Matesf 



Individual 



A 
B 
C 
D 

Total 

A * 

B 

C 

F 

Total 

A 
B 
C 

D* ' 
F 

Total 

Average 
Average 
Average 



. * ' 



2 kbp 
• 0 

11.736,757 

853.819 
952.523 

13.543.099 
0 

2.20 
0.16 
0.18 

. . . 0 
254 

0 

256 
022 
024 
0 

3.42 

. 1.951 bp 
6:10% 
74.50 



Number of reads for different insert libraries 
10 kbp 50 kbp 



0 

7.467.755 
881.290 
1,046,815 
. .1,498.607 
10.894.467 



. 0 
1.40 
1.17 
020 
; 028 
2.04 

0- ■ 
1126 
•133 - 
158 
226 
16.43 

10,800 bp 
8.10% 



2.767357 
66,930 ' 
0 
0 
0 

2.834.287 

0.52 
0.01 . 
0 
6 
0 

0.53 

1839 
0.44 
0 
0 
0 

18.84 

50,715 bp 
14.90% 



Total 

2.767357 
19,271.442 
1.735.109 
1,999,338 
1.498.607 
27.271.853 

0.52 
. 3.61 
032 
037 
028 
5.11 
1839 
14.67 
1.54 
1.82 
226 
38.68 



Total number of 
base pairs 



1,502.674.851 
10,464.393.006 
942.164.187 
1.085,640.534 
813.743.601 
14.808,616.179 



4 « ' 
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nome, and even a modest .-.error rate can 
reduce the effectiveness of assembly. In 
addition, maintaining the validity of mate- 
pair information is absolutely critical for. 
the algorithms described below. Procedural 
controls were established for maintaining 
the validity of sequence mate-pairs as se- 
quencing reactions, proceeded through the 
process, including strict rules built into the • 
LIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated 
in the course of the Drosophila genome 
project (26). By collecting data for the 
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entire human genome in a single facility, 
we were able to ensure uniform quality 
standards and the cost advantages associat- 
ed with automation, an economy of scale, 
and process consistency. 

2 Genome Assembly Strategy and . 
Characterization' 

Summary. We describe in this section the two 
approaches that we used 4o assemble the ge- 
nome. One method involves the computational 
combination of all sequence reads with shred- 
ded data from GenBank to generate an indepen- 



dent, nonbiased view of the genome. The sec- 
ond approach involves clustering all of the frag- 
ments to a region or chromosome on the basis 
of mapping inforrnatioEL The clustered data 
were then shredded and subjected to computa- 
tional assembly. Both approaches provided es- 
sentially the same reconstruction of assembled 

. DNA sequence with proper order and orienta- 
tion. The second method . provided slightly 
greater sequence coverage (fewer gaps) and 
was the principal sequence used for the analysis 

■ phase. In addition, we document the complete- 
ness and correctness of this assembly process 
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defined quality guidelines. Manufacturing pipeline P roces * e - 
quality control measures, and responsible parties are inaic- 
described further in the text. 



and provide a comparison to the public get. ; 
sequence, which was reconstructed largely by 
an independent BAC-by-BAC approach. Our 
assemblies effectively covered the euchromatic 
regions of the human chromosomes. More than 
90% of the genome was in scaffold assemblies 
of 100,000 bp or greater, and 25% of the ge- 
nome was in scaffolds of 10 million bp or 
larger. 

Shotgun sequence assembly is a classic 
example of an inverse problem: given a set 
of reads randomly sampled from a target 
sequence, reconstruct the order and the pb-. 
sition of those reads in the target. Genome 
assembly algorithms developed for Dro- 
sophila have now been extended to assemble 
the ~25-fold larger human genome. Celera as- 
semblies consist of a set of contigs that are 
ordered and oriented into scaffolds that are then 
mapped to chromosomal locations by using 
known markers. The contigs consist of a col- 
lection of overlapping sequence reads that pro- 
vide a consensus reconstruction for a contigu- 
ous interval of the genome. Mate pairs are a 
central component of the assembly strategy. 
They are used to produce scaffolds in which the 
size of gaps between consecutive contigs is 
known with reasonable precision. This is ac- 
complished by observing' that a ^air of reads, 
one of which is in one contig. and' the other of 
which is in another, implies an orientation and 
distance between the two contigs (Fig. 3). Fi- 
nally, our assemblies did not incorporate all 
reads into the final set of reported scaffolds: 
This set of unincorporated reads is termed 
"chaff," and typically consisted of reads from 
within highly repetitive regions, data from other 
organisms introduced through various routes as 
found in many genome projects, and data of 
poor quality or with untrimmed vector. 
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2.1 Assembly data sets 
We used two independent sets of data for our 
assemblies. The first was a random shotgun 
data set of 27.27 miUion reads of average length 
543 bp produced at Celera. This consisted 
largely of mate-pair reads from 16 libraries 
constructed from DNA samples taken from five 
different donors. Libraries with insert sues of 2, 
10, and 50 kbp were used. By looking at how 
mate pairs .from a library were positioned in 
known sequenced stretches of the genome, we . 
■ were able to characterize the range of insert 
: sizes in each library and determine a mean and 
standard deviation. Table 1 details the number 
of reads, sequencing coverage, and clone cov- 
erage achieved by the data set. The clone cov- 
erage is the coverage of the genome in cloned 
DNA, considering the entire insert of each 
clone' that has sequence from both ends. The 
clone- coverage provides a measure of the 
amount of physical DNA coverage of the ge- 
nome. Assuming a genome size of 2.9 Gbp, the 
Celera trimmed sequences gave a 5.1 X cover- 
age of the genome, and clone coverage was 
3 42X, 16.40X.and 18.84X.forthe2-, 10-, and 
50-kbp libraries, respectively, for a total of 
38.7X clone coverage. * 
The second data set was from the publicly 
funded Human Genome Project (PFP) and is 
primarily derived from BAC clones (30). The 
BAC data input to the assemblies came from a 
download of GcnBank on 1 September 2000 
(Table 2) totaling 44433 Mbp of sequence. 
The data for each BAC is deposited at one of 
' four levels of completion. Phase 0 data are a set. 
\ of generally unassembled sequencing reads 
; from a very light shotgun of the BAC, typically 
less man IX. Phase 1 data are unordered as- 
semblies of contigs, which we call BAC contigs 
or bactigs. Phase 2 data are ordered assemblies 
of bactigs. Phase 3 data are complete BAC 



Mapped 
Scaffolds: 



STS 
-* 



, Genome 



Scaffold:' 



• •» 



Read pair (mates) 



v 



Co n tig: 



Gap (mean & std^dev. Known) 

.. s- ■ . • - 

Consensus ^ ^> 

Reads (of several haplotypes) 



• SNPs 
BAC Fragments 

Fig. 3. Anatomy of wholesome assembly. SS VSSSt 

Internally derived reads from five different Indivlduab frhdttag (red) by usins 

contig and a consensus sequence (green l.ne). ^^^^Sh^SvMi 5TTS (blue star] 
mate pair Information. Scaffolds are then mapped to the gendme (gray imej wk i 

physical map Information. 



equences. In the past 2 years the PFP has 
focused on a product of lower quality and com- 
pleteness, but on a faster time-course, by con- 
centrating on the production of Phase 1 data 
from a 3X to 4X light-shotgun of each BAC 
clone. 

We screened the bactig sequences for con- 
taminants by using the BLAST algorithm 
against three data sets: (i) vector sequences 
in Univec core (35), filtered for a 25-bp 
match at 98% sequence identity at the ends 
of the sequence and a 30-bp match internal 
to the sequence; (ii) the nonhuman portion 
of the High Throughput Genomic (HTG) 
Seqences division of GenBank (3P), fil- 
tered at 200 bp at 98%; and (iii) the non- 
redundant nucleotide sequences from Gen- 
Bank without primate and human virus en- 
tries, filtered at 200 bp at 98%. Whenever 
25 bp or more of vector was found within 
50 bp of the end of a contig, the tip up to 
the matching vector was excised. Under 
these criteria we removed 2.6 Mbp of pos- 
sible contaminant and vector from the 
Phase 3 data, 61.0 Mbp from the Phase 1 
and 2 data, and 16.1 Mbp from the Phase 0 
data (Table 2). This left us with a total of 
4363.7 Mbp of PFP sequence data 20% 
finished, 75% rough-draft (Phase 1 and 2), 
and 5% single sequencing reads (Phase 0). 
An additional 104,018 BAC end-sequence 
mate pairs were also downloaded and in- 
cluded in the data sets for both assembly 
processes (75). 

2.2 Assembly strategies . 
Two different approaches to assembly were 
pursued. The first was. a whole-genome as- 
sembly process that used Celera data and the 
PFP data in the form of additional synthetic 
shotgun data, and the second was a compart- 
mentalized assembly process that first parti- 
tioned the Celera and PFP data into sets 
• • localized to large chromosomal segments and 
then performed ab initio shotgun assembly on 
each set. Figure 4 gives a schematic of the 
overall.process flow. 

For the whole-genome assembly, the Pi-P 
data was first disassembled or "shredded" into a 
synthetic shotgun data set of 550-bp reads that 
form a perfect 2X covering of.the bactigs. This 
resulted in 16.05 rnillion "faux" reads that were 
sufficient to cover the genome 2.96X because 
of redundancy , in the BAC . data set, without 
incorporating the biases inherent in the PFP 
assembly process. The combined data set of 
4332 million reads (8X), and all associated 
mate-pair information, were then subjected to 
our whole-genome assembly algorithm to pro- 
duce a reconstruction of the genome. Neither 
the location of a BAC in the genome nor its 
assembly of bactigs was used in this process. 
Bactigs were shredded into reads because we 
found strong evidence that 2.13% of them were 
misassembled (40). Furthermore, BAC location 
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information was ignored because some BACs 
were not correctly placed on the PFP physical 
map and because we found strong evidence that 

Table 2. CenBank data input into assembly. 
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at least 22% of the BACs contained sequence 
data that were not part of the given BAC (41), 
possibly as a result of sample-tracking errors 



Completion phase sequence 



Center 
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Whitehead Institute/ 
MIT Center for 
Genome Research. 
USA 



Washington University.' 
USA 



Baylor College of 
Medicine, USA 



Production Sequencing 
Facility, DOE Joint 
Genome Institute, 
USA 
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Total contaminant masked 

(bp) 

'.Average contig length (bp) 
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2,825 6,533 

243,786 138,023 

194.490,158 1,083,848,245 

1,553,597 875,618 

13,654,482 4,417,055 
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Number of contigs 
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Total contaminant' masked 

(bp) 
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798 

19 
2,127 
1,195,732 
21.604 
22.469 

562 

0 
0 
0 
0 
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8,680.214 
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2,043 
34,938 
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8.422 

1,149 
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279.477 
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--21.015 
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3360,047.574 
- 2.438.575 
16.311.664 

8.203 



1,300 
1,300 
164.214.395 
8,287 
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363 
363 

49,017.104 
4,960 
485,137 

135.033 
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754 
60,975,328 

7,274 
118,387 

80.867 

300 
300 
20,093.926 
2,371 
-27,781 
66,978 

2.599 
2;599 
246,1 18.000J 
25,054 
374,561 
94.697 

3.458 
3.458 
246.474.157 
32,136 
1.791,849 

'71.277 

9.137 
. 9.137 
835.722.268 
82.284 
' 3.365.230 
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shredded Into faux reads resulting In Z96X coverage oiu.^ 



(see below). In short, we performed a true, a" 
initio whole-genome assembly in which u 
took the expedient of deriving additional v. 
quence coverage, but not mate pairs, assemble 
bactigs, or genome locality, from some cxtc 
nally generated data. 

-In the compartmentalized shotgun assernb'. 
(CSA), Celera and PFP data were partiu'ont 
into the largest possible chromosomal segmcn 
or "components" that could be deterrriincd wr 
confidence, and then shotgun assembly was d; 
plied to each partitioned subset wherein 0 
bactig data were again shredded into faux rca< 
to ensure an independent ab initio assembly • 
.the component By subsetting the data in ll 
way, the overall computational effort was i 
duced and the effect of interchromosomal dup 
cations was ameliorated. This also resulted in 
reconstruction of the genome that was relative 
independent of the whole-genome assembly i 
suits so that the two assemblies could be co; 
. pared for consistency. The^quality of the pa; 
tioning into components was crucial so tl 
different genome regions were not mixed ' 
gether. We constructed components from (0 i 
longest scaffolds of the sequence from lm 
BAC and (ii) assembled scaffolds of data unit 
to Celera's data set The BAC assemblies w 
obtained by a combining assembler that used i 
bactigs and the 5X Celera data mapped to the 
' bactigs as input This effort was undertaken 
an interim step solely because the more accur 
and complete the scaffold for a given seque. 
stretch, the more accurately one can tile in 
scaffolds into contiguous components on 
basis of sequence overlap and mate-pair ml 
mation. We further visually inspected and I . 
rated the scaffold tiling of the 
further increase its accuracy. For the final C 
assembly, all but the partitioning was ignoi 
Z ^dependent, ab initio reconstr^ctior, 
the sequence in each component was obta 
by applying our whole-genome assembly n 
' ritfimto the partitioned, relevant Celera data. 
It shredded! faux reads of the partition^ 
evant bactig data. 
2 3 Whote-genome assembly 
Tie algorithms used for whole-genome 

of — f "J C J 
enhancements to those u ed to prow 
sequence of the Drosophila genome rcpo 

in detail in (28). .. enf . pipe 

The WGA assembler consists of a pi 

composed of five principal^ s. S c 
Overlapper, Uniugger Solder and f 
Rcsolver, respectively. The ^ Scree 
. and marks all microsatell.te repeats w 
than a 6-bp element, and screens 
known interspersed ^peat. elements, ( 
ing Alu, Line, and ribosomal DNA. 
regions get searched for overlaps yv 
scfeened regions do not get seaxched 

be part of an overlap that mvolves- j 
matching segments. 
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The Overlapper compares every read 
against every other read in search of complete 
end-to-end overlaps of at least 40 bp and with 
no more than 6% differences in the match. . 
Because all data are scrupulously vector- 
trimmed, the Overlapper can insist" on com- 
plete overlap matches. Computing the set of 
all overlaps took roughly 10,000 CPU hours 
with a suite of four-processor Alpha SMPs 
with 4 gigabytes of RAM. .This took 4 to 5 
days in. elapsed time with 40 such machines. 
" operating in parallel.- . -.. \ •" • / •* 

Every overlap computed above is statisti- 
cally a l-in-10 17 event and thus not a coinci- ■ 
dental event. What makes assembly combi- 
hatorially difficult is that while many over- 
laps are actually sampled from overlapping 
regions of the genome, and thus imply that 
the sequence reads should be assembled to- 
gether, even more overlaps are actually from 
two distinct copies of a low-copy repeated 
element not screened above, thus constituting 
an error if put together. We call the former 
"true overlaps" and the latter "repeat-induced 
overlaps." The assembler must avoid choos- 
ing repeat-induced overlaps, especially early 

in the process. 

We achieve this objective- in the Unitig- 
ger. We first find all assemblies of .reads that 
appear to be uncontested with respect to all 
other reads. We call the contigs formed from 
these subassemblies unitigs (for uniquely as- 
sembled contigs). Formally, these unitigs are 
the uncontested interval subgraphs of. the 
graph of all overlaps (42). Unfortunately; 'al- 
though empirically many of these assemblies 
are correct (and thus involve only true over- 
" laps), some are in fact collections of reads 
from several copies of a repetitive element 
that have been overcollapsed into a single 
subassembly. However, the overcollapsed 
unitigs are easily identified because their av- 
erage coverage depth is too high to be con- 
sistent with the overall level of sequence 
coverage. We developed a simple statistical 
discriminator that gives "the logarithm of the 
odds ratio that a unitig is composed of unique 
DNA or of a repeat consisting of two or more _ 
copies. The discriminator, set to a sufficiently 
-"stringent threshold,- identifies a subset of the 
unitigs that we are certain are correct In 
addition, a second*. less stringent threshold 
. identifies a subset of remaining unitigs very 
likely to be correctly assembled 4 o f which we 
select 'those that will consistently scaffold 
(see below), and thus are again almost certain 
to be correct. We call the union of these two 
sets U-unitigs. Empirically, we found from a 
6X simulated shotgun of human chromosome 
22 that we get U-unitigs covering 98% of the 
stretches of unique DNA that .are >2 kbp 
long. We are further able to identify the 
boundary of the start of a repetitive clement 
at the ends of a U-unitig and leverage this so 
that U-unitigs span more than 93% of all 



singly interspersed Alu elements and other 
100-to 400-bp repetitive segments. 

The result of running the Unitigger was 
thus a set of correctly assembled subcontigs 
covering an estimated 73.6% of the human 
genome. The Scaffolder then proceeded to 
use mate-pair information to link these to- 
gether* into scaffolds. When there are two or 
more mate pairs that imply that a given pair 
of U-unitigs are at a certain distance and 
orientation with respect to each other, the 
probability . of -this being wrong is again 
roughly 1 in 10 l °, assuming that mate pairs 
are false less than 2% of the time. Thus, one 
can with high confidence link together all 
U-unitigs that are linked by at least two 2- or 
10-kbp mate pairs- producing intermediate- 
sized scaffolds that are then recursively 
linked together by confirming- 5 0-kbp mate 
pairs and BAC end sequences. This process 
yielded scaffolds that are 'on the order of 
megabase pairs in size with gaps between 
their contigs that generally correspond to re- 
petitive elements and occasionally to small 
sequencing gaps. These scaffolds reconstruct 
the majority of the unique sequence within a 

genome. , 
For the Drosophila assembly, we engaged 
in a . three-stage repeat resolution strategy 
where each stage -was progressively. more . 



aggressive and thus more likely to make a 
mistake. For the human assembly, we contin- 
ued to use the fust "Rocks" substage where 
all unitigs with a good, but not definitive, 
discriminator score are placed in a scaffold 
gap. This was done with the condition that 
two or more mate pairs with one of their 
reads already in the scaffold unambiguously 
place the unitig in the given gap. We estimate 
the probability of inserting a unitig into an. 
- incorrect gap with this strategy to be less than : 
' 10~ 7 based on a probabilistic analysis. 

We revised the ensuing "Stones" substage 
. of the human assembly, making it more like 
the mechanism suggested in our earlier work 
(43). For each gap, every read R that is placed 
in the gap by virtue of its mated pair M being 
in a contig of the scaffold and implying R's 
placement is collected. Celera's mate-painng 
information is correct more than 99% of the 
time. Thus, almost every, but not all, of the 
reads in the set belong in the gap, and when 
a read does not belong it rarely agrees with 
the remainder of the reads. Therefore, we 
simply assemble this set of reads within the 
gap, eliminating any reads that conflict with 
the assembly. This operation proved much 
more reliable than the one it replaced for the 
Drosophila assembly; in the assembly of a 
simulated shotgun data set of human chromo- 
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some 22, all stones were placed correctly. 

The final method of resolving gaps is to 
fill them with assembled B AC data that cover 
the gap. V/e call this external gap "walking." 
We did not include the very aggressive "Peb- 
bles" substage described in our Drosophila 
work, which made enough mistakes so as to 
produce repeat reconstructions for long inter- 
spersed elements whose quality was only 
99.62% correct. We decided that for the hu- 
man genome it was philosophically better not 
to introduce a step that was certain to produce 
less than 99.99% accuracy. The cost was a 
somewhat larger number -of gaps of some- 
what larger size. 

. At the final stage of the assembly process, 
" and also at several intermediate points, a 
consensus sequence of every contig is pro- 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with quality- 
value-weighte'd measures for evaluating each 
base. The net effect is a Bayesian estimate of 
the correct base to report at each position. 
Consensus generation uses Celera data when- 
ever it is present, Irfthe event that no Celera 
data cover a given region, the BAC data 

« 

sequence is used. 

A key element of achieving a.AVGA of the 
human genome was to parallelize the Overlap- 
per and the central consensus sequence-con- 
structing subroutines. In adoption, memory was 
a .real issue — a straightforward "application of 
the software we had built for Drosophila would - 



have required a computer with a 600-gigabytfi 
RAM. By making the Overlapper and Unitigger 
incremental, we were able to achieve the same 
computation with a maximum of instantaneous 
usage of 28 gigabytes of RAM. Moreover, the 
incremental nature of the first three stages al- 
lowed us to continually update the state of this 
part of the computation as data were delivered 
and then perform a 7-day run to complete Scaf- 
folding and Repeat Resolution whenever de- 

-siredV-For our assembly operations, -the total 
compute infiastructure consists of 10 four-pro- 
cessor SMPs with 4 gigabytes of memory per 
cluster (Compaq's ES40, Regatta) and a 16- 
processor NUMA machine with 64 gigabytes 
of memory (Compaq's GS160, Wildfire). The 
total compute for a run of the assembler was 
roughly 20,000 CPU hours. 

The assembly of Celera's data, together 
with the shredded bactig data, produced a set of 
scaffolds totaling Z848 Gbp in span and con- 
sisting of 2.586 Gbp of sequence. The chaff, or 
set of reads not incorporated in the 'assembly, 
numbered 1U7 million (26%), which is con- 
sistent with our experience for Drosophila. 
More than 84% of the genome was covered by 

- scaffolds > 100 kbp long, and these averaged 
91% sequence and. 9% gaps with a total of 
2.297 Gbp of sequence. There were a total of 
93,857 gaps among the 1637 scaffolds >100 
kbp. The average, scaffold size was - 1.5 Mbp, 
the average contig size was 24.06 kbp, arid the . 

* average gap size was 2.43 kbp, where the dis- 



tribution of each' was essentially exponent 
More than 50% of all gaps were less than ; 
bp long, >62% of all gaps were less than 1 i 
long, and no gap. was >100 kbp long. Simi 
ly, more than 65% of the sequence is in con 
>30 kbp, more than 31% is in contigs > 
kbp, and the largest contig was 1.22 Mbp lc 
Table 3 gives detailed summary statistics 
the structure of this assembly with a di 
comparison to the compartmentalized shot 
assembly. 

* 

2.4 Compartmentalized shotgun 
assembly 

In addition to the WGA approach, we j 
sued a localized assembly approach that • 
intended to subdivide the genome into ; 
ments, each of which could be shotgun 
sembled individually. We expected that 
would help in resolution of large intercl 
mosomal duplications and improve the sta- 
tics for calculating U-unitigs. The comp 
mentalized assembly process involved c 
tering Celera reads and bactigs into la 
multiple megabase regions of the gene 
and then running the WGA assembler on 
Celera data and shredded, faux reads 
tained from the bactig data. 

The first phase of the CSA strategy w; 
separate Celera reads into those that mat( 
the BAC contigs for a particular .PFP I 
. entry, and those that did not match any pi 
data. Such matches must be guarantee 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies.' 



Scaffold size 




No. of bp In scaffolds 

(including intrascaffotd gaps) 
. No. of bp in contigs 
No. of scaffolds 
No. of, contigs v j 

No. of gaps /\- m 
No. of gapyssi kbp \ 
Average scaffold size (bp) 
Average <6pti£ size (bp) 
Average Intra scaffold gap size 
(bp) 

t Largest contig (bp) v 
% of total contigs 

No. of bp In scaffolds 

(including Intrascaffold gaps) 
No. of bp In contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps ^1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average Intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 



All 



2.905,568,203 

2.653,979.733 
53.591 
170.033 
116,442 
72,091 
54,217 
15,609 
2,161 

1,988,321 
100 

2.847,890390 

2,5^6,634,108 
; 118.968 
221,036 
102,068 
62356 
23.938 
11.702 
2.560 

1.224.073 
100 



>30 kbp 



>100 kbp 



Compartmentalized shotgun assembly 

2.748.892.430 2,700,489,906 



2,524.251,302 
- 2.845 
- . 112,207 
109.362 
69.175 
„ 966,219 
22,496 
2,054 



2.491,538,372- 
1,935 
107.199 
' 105.264 
67.289 
1,395.602 
23,242 
1.985 



" 1.988.321 
95 

Whole-genome assembly 
2.574,792,618% ... 



2334343.339 
2.507 
99,189 
96.682 
60.343 
1.027,041 
23.534 
2,487 



V 



1,224.073 
90 



1,988.321 
94 

i 

2.525.334.447 

♦ » 

2.297.678,935 
1.637 
95.494., 
93.857 
59,156 
1.542.660 
24.061 
2.426 

1,224.073 
89 



>500 kbp 



2.489.357.260 

2.320,648,201 
1.060 
93,138 
92.078 
59.915 
2.348.450 
24.916 
1.832 

1,988,321 
87 



2,328,535.466 

2,143.002,184 
818 
84,641 
• 83,823 
54.079 
2.846,620 

25,319 
2,213 

1.224,073 
83 



2,248,68S 
2,106.52'i 
82 

8'! 

51 
3,11* 
Z\ 

1,981 

2.140.94: 
1,983,30^ 

7f 
7.' 
4* 
3.86 
2 

* 

1.22 
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oroperiy place a Celera read, so all reads were 
first masked against a library of common 
repetitive elements, and only matches of at 
IfLt 40 bp to unmasked portions of the read 
constituted a hit. Of Celera's 27JZ7 million 
reads 20 76 million matched a bactig and 
another 0.62 million reads, which did not 
have any matches, were nonetheless identi- 
fied as belonging in the region of the bactig s 
BAC because their mate matched the bactig. 
Of the remaining reads, 2.92 million were . 
completely screened out and. so could not be 
. matched, but the other 2.97 million reads had • 
unmasked sequence totaling 1.189 Gbp that 
were not found in the GenBank data set . 
Because the Celera data are 5.1 1 X redundant, 
we estimate that 240 Mbp of unique Celera 
sequence is not in the GenBank data set. 

In the next step of the CSA process, a 
combining assembler took the relevant SX 
Celera reads andbactigs for a BAC entry, and 
produced an assembly of the combined data 
for that locale. These high-quality sequence 
reconstructions were a transient result whose 
utility was simply to provide more reliable 
information for the purposes of their tiling 
into sets of overlapping and adjacent scaffold 
sequences in the next step. In outline, the 
combining assembler first examines the set of 
matching Celera reads to detemune-if there 
are excessive pileups indicative of un- 
screened repetitive elements. Wherever these, 
occur, reads in the repeat region whose mates 
have not been mapped to consistent positions 
are removed. Then all sets of mate pairs that 
consistently imply the same relative position 
of two bactigs are bundled into a link and 
weighted according to the number of mates in 
the bundle. A "greedy" strategy then attempts 
to order the bactigs by selecting bund es of 
mate-pairs in order of their weight. A selected 
mate-pair bundle can tie together two forma- 
tive scaffolds. It is incorporated to form a 
single scaffold only if it is consistent with the 
majority of links between contigs of the scat- 
fold. Once scaffolding is complete, gaps are 
filled by the "Stones" strategy described 
above for the WGA assembler. 

The GenBaiucdata for the Phase J.- and 2 
BACs consisted-of an average of 19.8;bactigs 
per BAC of average size £099 bp. Applica- 
tion of the combining assembler resulted in 
individual Celera BAC assemblies being put 
together into an average of 1.83 scaffolds 
(median of 1 scaffold) consisting of an aver- 
age of 8.57 contigs of average size 18,973 bp. 
In addition to defining order and orientation 
of the sequence fragments, there were 57% 
fewer gaps in the combined result For Phase 
0 data, the average.GenBank entry consisted 
of 91.52 reads of average length 784 bp. 
Application of the combining assembler re- 
sulted in an average of 54.8 scaffolds consist- 
ing of an average of 58.1 contigs of average 
size 873 bp. Basically, some small amount of 



assembly took place, but not enough Celera 
data were matched to truly assemble the OJX 
to 1 X data set represented by the typical 
Phase 0 BACs. The combining assembler 
was also applied to the Phase 3 BACs for 
SNP identification, confirmation of assem*. 
bly and localization of the Celera reads The 
phase 0 data suggest that a combined whole- 
genome shotgun data set and IX hght-shot- 
|un of BACs will not yield good assembly of 
BAC regions; at least 3 X light-shotgun of 
each BAC is needed. ... ... \ • ■ • 

• . The -5.89 million - Celera fragments not 
matching the GenBank data were assembled 
with our whole-genome assembler. The as- 
™mbly resulted in a-set of scaffolds totahng 
442 Mbp in span and consisting of 326 Mhp 
of sequence. More than 20% of the scaffo ds 
were >5 kbp long, and these averaged 63 A 
sequence and 27% gaps .with a total of 302 
Mop of sequence. All scaffolds >5 kbp were 
forwarded along with all scaffolds produced 
by the combining assembler to the subse- 
quent tiling phase. 

q At this stage, we typically had one or too 
■ scaffolds for every BAC region constituting 
at least 95% of the relevant "V*™-™ * 
collection of disjoint Celera-unique scaffolds. 
The next step in developing the genome com- 
ponents was to determine the order and over- 
lap tiling of these BAC and MW. 
scaffolds across the genome. For this %ve 

• "used Celera's 50-kbp mate-pairs informauon 
and B AC-end pairs (18) and sequence tagged 

• site (STS) markers (44) to provide long- 
" "range guidance and chromosome separation 

GTven *e relatively manageable number of 

• scaffolds, we chose not to produce this tiling 
in a fully automated manner, but to compute 
£ inS tiling with a good heuristic and then 
use human curators to resolve 
or missed join opportunities. To this end. we 
developed a graphical user interface that dis- 
played the graph of tiling overlaps and the 
Since fofeach. A human curator could 
then explore the implication of W d ST * 
data dot-plots of sequence overlap, and a 
vS'al display of the mate-pair evidence : sup- 
porting a P giveo.choice,The result oT tm 
process was a collection of "components 
w^ere each component "was a nled set of 
BAC and Celera-unique. scaffolds that had. 
been curator-approved. The process resuhed 
hi 3845 components with an estimated span 

of 2.922 Gbp- „ . • . 

In order to generate the final -CSA, we 
assembled each component with the WGA 
atgoX-AswasdonemmeWGAprocess 

2 bactig data were shredded mto a synfcene 
2X shotgun data set m order to^give the 
Semblefme freedom to independently as- 
semble the data. By using faux reads rather 
man bactigs. the assem ^-"^"J* 

correct errors in the ^^gSto 
remove chimeric content m a PFP data entry. 



C aerie or contaminating sequence (from 
another part of the genome) would not be 
incorporated into the reassembly of the com- 
ponent because it did not belong there. In 
effect, the previous steps in the CSA process 
served only to bring together Celera frag- 
ments and PFP data relevant to a large con- 
tiguous segment of the genome, wherein we 
applied the assembler used for WGA to pro- 
duce an ab initio, assembly of the region. 

WGA assembly of the components result- 
ed in a set of scaffolds totaling 2:906. Gbp in „; 
span and consisting of 2-654 Gbp of se- 
quence. The chaff, or set of reads ; not ^incor- 
porated into the assembly, numbered 6.17 
Snon, or 22%. More than 90.0%. of the 
genome was covered by scaffolds spamung 
>100 kbp long, and these averaged 92.2 A 
sequence and 7.8% gaps with a total of 2.492 
Gbp of sequence. There were a total of 
105.264 gaps among the 107,199 contigs ftat 
belong to the 1940 scaffolds spanning ; >100 
kbo The average scaffold size was 1.4 Mbp 
£f average contig size was 23.24 kbp. and 
Z average gap size was 2.0 kbp where each 
distribution of sizes was exponential As 
such, averages tend to be ^errepresenUhve 
of the majority of the data. Figure 5 shows a 
hfs togram of L bases in scaffolds of various 
size ranges. Consider also that more than 
49% of all gaps were <500 bp long, more 
than 62% of all gaps were <1 kbp, and all 
gaps are < 100 kbp long. Similarly, more dian 
73% of the sequencers in contigs > 30 kbp 
Lore than 49% is in contigs :> 100. kbp and 
the largest contig was' 1 .99 Mbp long. Table 3 
provides summary statistics for the structoe 
of this assembly with a direct comparison to 
the WGA assembly. 



2.5 Comparison of the WGA and CSA 
scaffolds 

Having obtained two assemblies of the hu- 
man genome via independent computational 
processes (WGA and CSA) we compared 
scaffolds from the two assemblies as another 
means of investigating their completeness, 
Consistency; and contiguity. From each^as- 
sembly a set of reference scaffolds contain 
mg at least 1000 fragments (Celera sequenc- 
ini reads or bactig shreds) was oSumed tins 
amounted to 2218 WGA scaffolds _and 1717 
CSA scaffolds, for a total of 2,087 Gbp ana 
• 2 474 Gbp. The sequence of each reference 
scaffold w'as compared to the «^n«oT all 
scaffolds from the other assembly ™*™J* 
it shared at least 20 fragment or at least 20/o 
of the fragments of the smaller ^scaffold. Jo 
each such comparison, all m atcbes of at least 
200 bp with at most 2% mismatch were 

tabulated. ****** the 

From. this tabulation, we esbjnated I the 

amount of unique sequence m <f * 

in two ways. The first was to 

number of bases of each assembly that were 
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not covered by a matching segment in the 
other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered bythe CSA, where- 
as' 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any uniqueness of the matching segments. 
TTius, another analysis was conducted in. 
which matches of less than 1 kbp between a 
pair of scaffolds were excluded unless they 
were confirmed by other matches. having a: 
consistent order and orientation. This gives 
some measure of consistent coverage: 1.982 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 

measure. r r 

The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in 
which a large section of a scaffold from one 
assembly matched only one scaffold from the 
other assembly, but failed to match over the 
full length of the overlap implied by the 
matching segments. Aa- initial set of candi- 
dates was identified' automatically, and then 
each candidate was inspected by hand. From 
this process, we identified 31 instances in 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. . ■ - ' 

In addition,' we evaluated local inconsis- 
tencies of order or orientation. The following 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees with the positions they match in the 
former). Most of these small rearrangements 
involved segments on the" order of hundreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%)'ln the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than the WGA, because it 
was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, 
whereas the WGA is performing a shotgun 
assembly of a gigabase-sized problem. When 
one considers the increase of two-and-a-half 
orders of magnitude in problem size, the in- 
formation loss between the two is remarkably 
small. Because CSA was logistically easier to 
deliver and the better of the two results avail- ■ 
able at the time when downstream analyses 
needed to be begun, all subsequent analysis 
was performed on this assembly. * ■ 

■ - * „ 

2.6 Mapping scaffolds to the genome 
The final step in assembling the genome was to 
order and orient the scaffolds on the chromo- 
somes. We first grouped scaffolds together on 
the basis of their order in the components from 
CSA. These grouped scaffolds were reordered 
by exarriining residual mate-pairing data be- 
tween the scaffolds. We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on having 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- 
ers. There are two genome-wide types of map 
information available: high-density STS maps 
• and fingerprint maps of BAC clones developed 
at Washington University (45). Among the ge-.: 
nome-wide STS maps, GeneMap99 (GM99) 
has the most markers and therefore was most 
useful for mapping^scaffolds. The two different 
mapping approaches are complementary to one 
another. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 should have a more reliable 
long-range order, because the framework mark- 
ers were derived* from well-validated genetic 
maps. Both types of maps were used as a* 
reference for, human curation of the compo- 
nents that were the input to the regional assem- 
bly/ but' they 'did not deterrnine the order of 
sequences produced by the assembler. 
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Fig. 5. Distribution of scaffold sizes of the CSA. For each range of scaffold sixes, the percent of total 
sequence Is Indicated. 



In order to determine the effectiveness of 
the fingerprint maps and GM99 for mapping 
scaffolds, we first examined the reliability of 
these maps by comparison with large scaf- 
folds. Only 1% of the STS markers on the 10 
largest scaffolds (those >9 Mbp) were 
mapped on a different chromosome on 
GM99. Two percent of the STS markers dis- 
agreed in position by more than five frame r 
work bins.. However, for the fingerprint 
maps, a 2% chromosome discrepancy was 
observed, and on average 23.8% of BAC 
locations in the scaffold -sequence disagreed 
with fingerprint map placement by more than 
five BACs. When further examining the 
source of discrepancy, it was found that most 
of the discrepancy came from 4 of the 10 
scaffolds, indicating this there is variation in 
the quality of either the map or the scaffolds. 
All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 
analysis, and showed the same low discrep- 
ancy rate to GM99, and thus we concluded 
that the fingerprint map global order in these 
cases was not reliable. Smaller scaffolds had 
a higher discordance rate with GM99 (4.21°/ 
of STSs were discordant by more than five 

• framework bins), but a lower discordance rate 
with the fingerprint maps (11% of BACs 
disagreed with fingerprint maps by more than 
five BACs). This observation agrees with the 

■ clone coverage analysis (46) that Celera scaf- 
. fold construction was better supported by 
long-range mate pairs in larger scaffolds than 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAC oi 
STS) on these maps. Where the order oi 
scaffolds agreed between GM99 and the 
WashU BAC map, we had a high degree oi 
confidence that that order was correct; these 
scaffolds were termed "anchor scaffolds.* 

' Only scaffolds with a low overall discrepancy 

rate with both maps were considered anchoi 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio 
late their framework orders. Orientation o: 
individual scaffolds was determined by the 
presence of multiple "mapped markers will 
consistent order. Scaffolds with only on< 
marker have insufficient information to as 
sign orientation. We found 70.1% of the ge 
nome in anchored scaffolds, more than 99°A 
of which are also oriented (Table 4). Because 
GM99 is of lower resolution than the Washt 
map, a number of scaffolds without ST. 
matches could be ordered relative to the an 
chored scaffolds because they included se 
quence from the same or adjacent BACs or 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis 
crepancies, a number of scaffolds determine^ 
to be "unmappable" on the WashU map coulc 
be ordered relative to the anchored scafioia.' 
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with GM99. These scaffolds were termed 
"ordered scaffolds." We found that 13.9% of 
the assembly could be ordered by these ad- 
ditional methods, and thus 84.0% of the ge- 
nome was ordered unambiguously. 

Next, all scaffolds that could be placed, 
but not ordered, between anchors.were as- 
signed to the interval between the anchored 
scaffolds and were deemed to be "bound- 
ed" between them. For example, small scaf- 
folds having STS hits from the same Gene- 
Map bin or hitting the same B AC cannot be 
ordered relative to* each other, 'but can* be*, 
assigned a placement boundary relative to 
other anchored or ordered scaffolds. The 
remaining scaffolds either had no localiza- 
tion information, conflicting information, 
or could , only be assigned to a generic 
chromosome location. Using the above ap- 
proaches, —98% of the genome was an- 
chored, ordered, or bounded. 

Finally, we assigned a location for each 
scaffold placed on the chromosome by 
spreading out the scaffolds per chromosome. 
We assumed that the remaining unmapped 
scaffolds, constituting 2% of the genome, 
were distributed evenly across the genome. 
By dividing the sum of unmapped, scaffold 
lengths with the sum of the" number of 
mapped scaffolds, we arrived at an estimate 
of interscaffold gap of 1483 bp. This gap was 
used to separate all the scaffolds on each 
chromosome and to assign an offset in the ' 
chromosome. 

During the scaffold-mapping effort/we en- - 
countered many problems that resulted in addi- 
tional quality assessment and validation analy- . 
sis. At least 978 (3% of 33,173) BACs were 
believed to have sequence data from more than 
one location in the genome (47). This is con- 
sistent with the bactig chimerism analysis re- 
ported above in the Assembly Strategies sec- 
tion. These BACs could not be assigned to 
unique positions within the CSA assembly and 
thus could not be used for ordering scaffolds. 
Likewise, it was not always possible to assign 
STSs to unique locations in the assembly be- 
cause of genome duplications, repetitive ele- 
ments, and pscudogetfes. / 

Because of the* time .required for-arr ex- 
haustive search for a perfect- overlap," CSA 
generated 21,607 bt^cstffold gaps where 
the mate-pair data suggested that the contigs 
should overlap, but no overlap was found. 
These gaps were defined as a fixed 50 bp in 
length and make up 18.6% of the total 
1 1 6,442 gaps in the CSA assembly. 

We chose not to use the order of exons. 
implied in cDNA or EST data as a way of 
ordering scaffolds. The rationale, for not us- 
ing this data was that doing so would have 
biased certain regions of the assembly by 
rearranging scaffolds to fit the transcript data 
and made validation of both the assembly and 
gene definition processes more difficult. 



^.7 Assembly and validation analysis 
We analyzed the assembly of the genome 
" from the perspectives of completeness 
(amount of coverage of the genome) and 
correctness (the structural accuracy of the 
* order and orientation and the consensus se- 
quence of the 'assembly). 

Completeness. Completeness is defined as 
the percentage of the euchromatic sequence 
represented in the assembly. This cannot be 
: known with absolute certainty until the eu- 
•chromatin; sequence has been completed. 
However, it "is possible to estimate complete- 
ness on the basis of (i) the estimated sizes of 
intrascaffold gaps;' (ii) coverage of the two ■ 
published chromosomes, 21 and 22 (48, 49); 
and (iii) analysis of the percentage -of an 
independent set of random sequences (STS 
markers) contained in .the assembly. The . 
whole-genome libraries contain heterochro- 
matic sequence and, although no attempt has 
been made to assemble it, there may be in- 
stances of unique sequence embedded in re- 
gions of heterochromatin as were observed in 

Drosophila (50, 51). 

The sequences of human chromosomes %\ 
and 22 have been* completed to high quality 
and published (48, 49\ Although this se- 
quence served as input to the assembler, the 
finished sequence was shredded into a shot- 
gun data set so that the assembler had the 
-'opportunity, to assemble it differently from 
\ the original sequence in the case of structural 
polymorphisms • or assembly errors in the 
BAC data.. In particular, the assembler must 
be able to resolve repetitive elements at the 
scale of components (generally multimega- 
base in size), and so this coniparison reveals 
the level to which the assembler resolves 
repeats. In certain' areas, the. assembly struc- 
ture differs from the published versions of 
chromosomes 21 and 22 (see below). The 
consequence of the flexibility to assemble 
"finished" sequence differently on the basis 
of Celera data resulted in an assembly with 
more segments than the chromosome 21 and 
22 sequences. We examined the reasons why 
there are more gaps in the Celera sequence 
than in chromoSofnes 21" and 22 and expect 
that they may be typical of gaps in other 
'regions of the genome. In the Celera assem- 
blyi there are 25 scaffolds, each containing at 
least 10 kb of sequence; that collectively span 
94.3% of chromosome 21. Sixty-two scaf- 
folds span 95.7% of chromosome 22. The 
total length of the gaps remaining' in the 
Celera assembly for these two chrpmqspmes 
is 3;4"Mbp. These gap sequences were ana- 
lyzed by RepeatMasker and by searching 
against-the entire genome assembly (52) 
About 50% of the gap sequence consisted 'of 
common repetitive elements identified by Re- 
peatMasker; more than half of the remainder 
was lower copy number repeat elements. . • 
A more global way of assessing complete- 



ness is to measure the content of an independent 
set of sequence data in the assembly. We com- 
pared 43,938 STS markers from Genemap99 
(51) to the scaffolds. Because "these markers 
were not used in the assembly processes, they 
provided a truly independent measure of com- 
pleteness. ePCR (53) .and BLAST (54) were 
used to locate STSs on the assembled genome. 
We found 44,524 (91%) of the STSs in the 
mapped genome. An additional 2648 markers 
(5.4%) * wer ^ found by .searching the. unas- 
sembled daV or "chaff"- We identified 1283 
. STS markers (2.6%) not found in either Celera . 
sequence or BAC data as of September 2000, 
raising the possibility that these markers may 
not be of human origin. If that were the case, 
the Celera assembled sequence would represent 
93.4% of the human genome and the unas- 
sembled data 5.5%, for a total of 98.9% cover- 
age. Similarly, -we compared CSA against 
36,678 TNG radiation hybrid markers (55a) 
using the same method: We found that 32,371 
markers (88%) were located in the mapped 
CSA scaffolds, with 2055 markers (5.6%) 
found in the remainder. This gave a 94% cov- 
erage of the genome through another genome- 
wide survey. 

Correctness. Correctness is defined as the 
structural and sequence accuracy of the as- 
sembly. Because the source sequences for the 
-Celera data and the GenBank data are from 
different individuals, we could not directly 
compare the consensus sequence of the as- 



-Table 4. Summary of scaffold mapping. Scaffolds 
were mapped to the genome with different levels 
of confidence (anchored scaffolds have the highest 
confidence; unmapped scaffolds have the lowest). 

-Anchored scaffolds were consistently ordered by 
the WashU BAC map and GM99. Ordered scaf- 
folds were consistently ordered by at least one of 
the following: the V/ashU BAC map, GM99, or 
component tiling path. Bounded scaffolds had or- 
der conflicts between at least two of the external 
maps, but their placements were adjacent to a 
neighboring anchored or ordered scaffold. Un- 
mapped scaffolds had, at most, a chromosome 
assignment The scaffold subcategories are given 
below each category. 



Mapped 
scaffold 
category * 



Number Length (bp) Total 

length 



Anchored 
Oriented 
Unoriented • • 

• 

Ordere'd 
Oriented 
Unoriented 

Bounded 
Oriented 
Unoriented 

Unmapped 
Known 

chromosome 
Unknown 

chromosome 



1,526 
1,246 
280 

2,001 
839 
1,162 

38.241 
7.453 
30,788 

11.823 
281 



1,860.676.676 70 

1.852,088,645 70 

. 8,588.031 0.3 

369,235.857 14 

329.633,166 12 

. 39.602,691 2 

368.753,463^ 14 

274,536.424 10 

94.217.039 4 

55,313.737 2 

2,505,844 0.1 



11,542 52,807.893 2 
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sembly against other finished sequence for 
dete rminin g sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a 
statistical estimate derived from the quality 
values of the underlying reads. 

The structural consistency of the assembly 
. can be measured by mate-pair analysis. In a 
correct assembly, every mated pair of se- 
* quencing reads should be located on the con- 
sensus sequence with the correct separation 
and orientation between the pairs. A pair is 
termed 'Valid'* when" the reads are in' the*, 
correct orientation, and the distance between 
them is within the mean ± 3 standard devi- - 
ations of the distribution of insert sizes of the 
library from which the pair was sampled. A ■ 
pair is termed "misoriented" when the reads 
are not correctly oriented, and is termed "mis- 
separated" when the distance between the 
reads is not in the correct range but the reads 
are correctly oriented. The mean ± the stan- 
dard deviation of each library used by the 
assembler was determined • as described 
above. To validate these, we exarnined all 
reads mapped to the -finished sequence of : 
chromosome 21 (48) and determined "how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- * 
merism (two different segments of the ge- 1 
nome cloned into the same plasmid), and how : . 
tight the distribution of insert sizes was for 
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those that were correct (Table 5). The stan- 
dard deviations for all Celera libraries were 
quite small, less than 15% of the insert 
length, with the exception of a few 50-kbp 
libraries. The 2- and 10-kbp libraries con- 
tained less than 2% invalid mate pairs, where- 
as the 50-kbp libraries were somewhat higher 
(~10%).-Thus, although the mate-pair infor- 
. mation was not perfect, its accuracy was such 
: that measuring valid, misoriented, and mis- 
separated pairs with respect to a given assem- 
bly was. deemed to be a reliable instrument 
for validation purposes, especially when sev- 
. eral mate pairs confirm or deny an ordering. 
The clone coverage of the genome was 

• 39X, meaning that any given base.pair was,; 
on average, contained in 39 clones or, equiv- 
alently, spanned by 39 mate-paired reads. 
Areas of low clone coverage or areas with a 
high proportion of invalid mate pairs would 
indicate potential assembly problems. We 
computed the coverage of each base in the 
assembly by valid mate pairs (Table 6).*In 
summary, for scaffolds >30 kbp in length, 
less than 1% of the Celera assembly was in 
regions of less than 3X clone coverage. Thus, 
more than 99% of the assembly, including 
order and orientation, is strongly supported 
by this measure alone. 

We exarnined the locations and number of 
all misoriented and misseparated mates. In 

• addition to doing this analysis on- the CSA 

• assembly (as' of . 1 October 2000), we.also. 
performed a study of the PFP assembly as of 



5 September 2000 (SO, 55b). In this latter 
case, Celera mate pairs had to be mapped to 
the PFP assembly. To avoid mapping errors 
due to high-fidelity repeats, the only pairs 
mapped were those for which both reads 
matched at only one location with less than 
6% differences. A threshold was set such that 
sets of five .or jnore simultaneously invalid 
mate pairs indicated a potential breakpoint, 
where the construction of the two assemblies 
differed. The graphic comparison of the CSA 
chromosome 21 assembly with the published 
sequence (Fig. 6A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breakpoints. There were a 
similar (small) number of breakpoints on 
both chromosome sequences. The exception 
was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
in 212 single-contig scaffolds) that were 
mapped to the wrong positions because they 
were too small to be mapped rejiably. Figures 

6 and 7 and Table 6 illustrate the mate-pair 
differences and breakpoints between the two 
assemblies. There was a higher percentage of 
misoriented and misseparated mate pairs in 
the large-insert libraries (50 kbp and BAC 
ends) than in the small-insert libraries in both 
assemblies (Table 6). The large-insert librar- 
ies are more likely to identify discrepancies 
simply because they span a larger segment of 
the genome. The graphic . comparison be- 
tween the two assemblies for chromosome 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to- 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement, they were considered invalid (number of invalid mate 
pairs). 



Library 
type 



2 kbp 



10 kbp 



50 kbp 



BES 



Library 
no. 



3 " 

•4 
5 
6 

7 .-■ 

8 

9 
10 
11 . 
12 
13 
14 

15 
16 
17 
18 
19 



Mean 
Insert 
size 
(bp) 



2.0 a* 

1.913 
2,166 

11,385 
14,523 
.9,635 
10223 

64,888 
53,410 
52,034 
52,282 
46,616 
55,788 
39,894 

48,931 
48,130 
106,027 
160,575 
164.155 



Chromosome 21 



SD 
(bp) 



•106 
*' 152 
175 

851 
1.875 
1,035 

928 

2,747 
5.834 

- 7,312 
7,454 
7378 

10,099 
5,019 

9.813 
4,232 
27,778 
54,973 
19.453 



50/ 
mean 



No. of 
mate 
-pairs 

.-tested 

- » • 



5.1 
719 
8.1 

7.5 
12.9 
10.7 

9.1 

AZ 
10.9 
14.1 
14.3 
15.8 
18.1 
12.6 

20.1 
8.8 
26.2 
34.2 
11.9 



Sum 



3.642 
28,029 
4,405 

..4,319 
" • 7.35S 
5,573 
34,079 

16 
914 
5,871 
2.629 
2.153 
2,244 
199 

' 144 
195 
330 
155 
642 

102.894 



No. of 
Invalid 
mate 
pairs 



38 
413 
57 

80 
156 
109 
399 

1 

,-T?0 
"" 569 
213 
215 
249 
7 

10 
14 

16' 
8 
44 

2,768 
(mean = 2.7) 



Invalid 



1.0 
1.5 
1.3 

1.9 
2.1 
2.0 
1.2 

6.3 
18.6 
9.7 
8.1 
10.0 

11.1 
3.5 

6.9 
7.2 
4.8 
5.2 
6.9 

2.7 



Mean 
Insert 
size (bp) 



2,082 
1,923 
2,162 

11,370 
14,142 
9,606 
10,190 

65,500 
53,311 
51.498 
52,282 
45,418 
53,062 
36,838 

47.845 
47.924 
152,000 
161.750 
176,500 



Genome 



SD 
(bp) 



90 
118 
158 

* 696 

- 1.4C2 

934 
777 

5^504 
5,546 

- 6,588 
7.454 
9,068 

10,893 
9,988 

4.774 
4,581 
26,600 
27.000 
19,500 



SO/ 
mean 

(%) 



4.3 
6.1 
7.3 

6.1 
9.9 
9.7 
7.6 

8.4 
10.4 
12.8 
14.3 
20.0 
20.5 
27.1 
10.0 

9.6 
17.5 
16.7 
11.05 
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gene boundaries. During this process, multiple 
hits to the same region were collapsed to a 
coherent set of data by tracking the coverage of 
a region. For example, if a group of bases was 
represented by multiple overlapping ESTs, the 
union of these regions matched by the set of 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a 
series of *'gene bins," each of which was be- 
lieved to contain a single gene. One weakness of 
this initial implementation of the algorithm was 
in predicting gene boundaries in regions of tan- 
demly duplicated genes. Gene clusters frequent- 
ly resulted in homologous neighboring genes 
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being joined together, resulting in an annotation 
that artificially concatenated these gene models. 

Next, known genes (those with exact match- 
es of a full-length cDNA sequence to the ge- 
nome) were identified, and the region corre- 
sponding to the cDNA was annotated as a 
predicted transcript. A subset of the curat- 
ed human gene set RefSeq from the Nation- 
al Center for Biotechnology Information 
(NCBI) was included as a data set searched in 
the computational pipeline. If a RefSeq tran- 
script matched the genome assembly for at least 
50% of its length at >92% identity, then the 
SIM4 (63) alignment of the RefSeq transcript to 



the region of the genome under analysis was 
promoted to the status of an Otto annotation. 
Because the genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 
mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

Regions that have a substantial amount of 
sequence similarity, but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21, (B) alt of chromosome 8, 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure, Celera 
fragment sequences were mapped onto each assem : 
bly. The PEP-assembly is indicated in the upper third 
of each panel; the Celera assembly-is indicated in the 
lower third. In trie center ofthe panel; green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
"indicate sequence blocks that are in the same orien-. 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations 
(red, misoriented; yellow, .incorrect distance between 
the mates) for each assembly grouped by library sire. 
(Mate pairs that are v/ithin the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10,000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/futl/29l/5507/1304/DC1. 
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evaluates evidence generated by the compu- 
tational pipeline, corresponding to conserva- 
tion between mouse and human genomic 
DNA, similarity to human transcripts (ESTs 
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and cDNAs), similarity to rodent transcripts 
(ESTs and cDNAs), and similarity of the 
translation of human genomic DNA to known 
proteins to predict potential genes in the hu- 



man genome. The sequence from the region 
of genomic DNA contained in a gene bin was 
extracted, and the subsequences supported by 
any homology evidence were marked (plus 100 
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Fig. 7. Schematic view of the distribution of breakpoints and targe gaps 
on all chromosomes. For each chromosome, the upper pair of lines 
represent the PFP assembly, and the tower pair of lines represent Celera s 



assembly. Blue tick marks represent breakpoints, whereas red tick marks 
represent a gap of larger than 10,000 bp. The number of breakpoints per 
chromosome Is indicated in black, and the chromosome numbers In reo. 
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bases flanking these regions). The other bases 
in the region, those not covered by any homol- 
ogy evidence, were replaced-by N's. This se- 
quence segment, with high confidence regions 
represented by the consensus genomic se- 
quence and the remainder represented by N*s, 
was then evaluated by Genscan to see if a 
consistent gene model could be generated. This 
procedure simplified the gene-prediction task 
. by first establishing the boundary for the gene 
(not a strength of most gene-finding algo- 
rithms), and by eliminating regions with no 
supporting evidence. If Genscan returned a 
plausible gene model, it was further evaluated 
before being promoted to an "Otto" annotation. 
The final Genscan predictions were often quite 
different from the prediction that Genscan re- 
turned on the same region of native genomic 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation. 

The next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the bomology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal" exons were consid- 
ered to be supported if they were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was required to be within 1 0 bases, but the 
external edge was allowed greater latitude to 
allow for 5' and 3' untranslated regions 
(UTRs)! To be retained, a prediction for a 
multi-exon gene must have evidence such that 
" the total number of "hits," as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and" these must cover the 
complete predicted open reading frame. For 
a single-exon gene; we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, "and 

Table 7. Sensitivity 'and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction-to the published 
RefSeq transcript, tatlying the number (N) of 
• uniquely atigned. RefSeq bases.' Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tiikey HSD; P < 0.001). ' * : 



Methocf 



Sensitivity Specificity 



Otto (RefSeq only)* 
Otto (homology)! 
Genscan 



0.939 
0.604 
0.501 



0.973 
0.884 
0.633 



•Refers to those annotations produced by Otto using only 
the S«m4-polIshed RefSeq alignment rather than an evi- 
dence-based Genscan prediction. ^Refers to those 
annotations produced by supplying all available evidence 
to Genscan. 
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those that passed were promoted to Otto 
predictions. Homology-based Otto predic- 
tions do not contain 3' and 5' untranslated 
sequence. Although three de novo gene-finding 
programs [GRAIL, Genscan, and FgenesH 
(63)] were run as part of the computational 
analysis, the results of these programs were not 
directly used in making the Otto predictions. 
. Otto predicted 11,226 additional genes by 
m eans of sequence similarity. 

3.2 Otto validation 

To validate the Otto homology-based process 
and the method that Otto uses to define the 
structures of known genes, we compared tran- 

' scripts predicted by Otto with their correspond- 
ing (and presumably correct) transcript from a 

' set of 4512 RefSeq transcripts for which there 

• was a unique SIM4 alignment (Table 7). In 
order to evaluate the relative performance of 
Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 
racy of gene models predicted by Otto^with 

■ only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
7). We measured the sensitivity (correctly pre- 
dicted bases divided by the total length of the 
cDNA) and specificity (correctly predicted 
bases divided by the sum of the correctly and 

: incorrectly predicted bases). Second, we exam- 
ined the sensitivity and specificity of the Otto 
. predictions that were made solely with the Ref- 
Seq sequence, which is the process that Otto . 

' uses to annotate known genes (Otto-RefSeq). 
And third, we determined the accuracy of the 
Genscan predictions, corresponding" to these 
RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 
and Otto-homology performed better than Gen- 
scan by both criteria. Thus, 6.1% of true RefSeq • 
nucleotides were not represented in the Otto- 
refseq annotations and 2.7% of the nucleotides 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come from legitimate 

* differences- between the Celera assembly 
and the RefSeq„transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sirri4 
during the alignment process, or the pres- 
ence of* alternatively spliced forms in the 
data setnised for the comparisons.. .. . 

Because Otto uses an 'evidence-based ap- 
proach to reconstruct genes^the absence of 
experimental evidence for intervening exons 
may inadvertantly result in a set of exons that 

V- cannot be spliced together to give rise to a 
transcript In such cases, Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a single transcript We also examined 
the tendency of these methods to- incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

Recognizing that the Otto system is quite 
conservative, we used a different gene-pre- 
diction strategy in regions where the ho- 
mology evidence was less strong. Here the 
results of de novo gene, predictions were 
used. For these genes, we insisted that a 
predicted transcript have at least two of the 
following types of evidence to be included 
in the gene set for further analysis: protein, 
human EST, rodent EST, or mouse genome 
fragment matches. This final class of pre- 
dicted genes is a subset of the predictions 
made by the three gene-finding programs 
that were used .in the computational pipe- 
line. For these, there, was not sufficient 
.sequence similarity information for Otto to 
attempt to predict a gene structure. The 
three de novo gene-finding programs re- 
sulted in about 155,695 predictions, of 
which —76,410 were nonredundant (non- 
overlapping with one another). Of these, 
57,935 did not overlap known genes or 
predictions-made by Otto. Only 21,350 of 
the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 

The sum of this number (21,350) and the 
number of Otto annotations (17,764), 39,1 14, 
is near the upper limit for the human gene 
complement. As seen in Table 8, if the re- 
quirement for other, supporting evidence is 

made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to —23,000. 
Requiring that a prediction be supported by 
all four categories of evidence is too stringent 
because it would eliminate genes that encode 
novel proteins (members of currently unde- 

"scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirement for at least one of the fol- 
lowing evidence/ty£es— homology to mouse 

' genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this number to 1010. Adding this to 
the numbers from the previous P^SJfP h 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 

• human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree oi 
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confidence based on the supporting evidence. 
Transcripts encoded by a set of 26,383 genes 
were assembled for further analysis. This set 
includes the 6538 genes predicted by Otto on 
the basis of matches to known "genes, 1 1,226 
transcripts predicted by Otto based on homol- 
ogy evidence, and 8619 from the subset of 
transcripts from de novo gene-prediction pro- 
grams that have two types of supporting ev- 
• . idence. The 26,383. genes are illustrated along 
.chromosome diagrams in Fig; 1. These are a 
" very preliminary set of annotations arid are 
subject to all the limitations of an automated 
process. Considerable refinement is still nec- 
essary to improve the accuracy of these tran- 
script predictions. All the predictions and 
descriptions of genes and the associated evi- 
dence that we present are the product of 
completely computational processes, not ex- 
pert curation. We have attempted to enumer- 
ate the genes in the human genome in such a 
way that we have different levels of confi- 
dence based on the amount of supporting 
evidence: known genes, genes with good pro- 
tein or EST homology evidence, and de novo 
gene predictions confirmed by modest ho- 
mology evidence. 

3.4 Features of human gene 
transcripts 

We estimate the average span for a "typi- 
cal" gene in the human DtiA sequence to 
be about 27,894 bases. This is based on the 
average span covered by . RefSeq tran- 
scripts, used because it represents our high- 
est confidence set. 

The set of transcripts promoted to gene 
annotations varies in a number of ways. As 
can be seen from Table 8 and Fig. 9, tran- 
scripts predicted by Otto tend to be longer, 
having on average about 7.8 exons, whereas 
those promoted from gene-prediction pro- 
grams average about 3.7 exons. The largest 
number of exons that we have identified in a 
transcript is 234 in the titin mRNA. Table 8 
compares the amounts of evidence that sup- 



The Human genome 

port the Otto and other predicted transcripts. 
For example, one can see that a typical Otto 
transcript has 6.99 of its 7.81 exons supported 
by protein homology evidence. As would be 
expected, the Otto transcripts generally have 
more support than do transcripts predicted by 
the de novo methods. 

4 Genome Structure 

Summary. This section describes several of 
.the rioncoding attributes of -the assembled 
genome sequence and their correlations with 
the predicted gene set. These include an anal- 
ysis of G+C content and gene density in the 
context of cytogenetic maps of the genome, 
an enumerative analysis of CpG islands, and 
a brief description of the genome-wide repet- 
itive elements. 



4.1 Cytogenetic maps 

Perhaps the most obvious, and certainly the 
•most visible, element of the structure of 
the genome is the banding pattern produced 
by Giemsa stain. Chromosomal banding 
studies have revealed that about 17% to 
20% of the human chromosome comple- 
ment consists of C-bands, or constitutive 
heterochromatin (64). Much of this hetero- 
chromatin is highly polymorphic and con- 
sists 'of different families of alpha satellite 
DNAs with various higher order repeat 
structures (6*5). Many chromosomes have 
complex inter- and intrachromosomal du- 
plications present in pericentromeric re- 
gions (6*5). About 5% of the sequence reads 
were identified as alpha satellite sequences; 
these were not included in the assembly. 
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Fig. 8. Analysis of split genes resulting from different annotation methods. A set of 4512 
Sim4-based alignments of RefSeq transcripts to the genomic assembly were chosen (see the text 
for criteria), and the numbers of overlapping Genscan, Otto (RefSeq only) annotations based solely 
on Sim4-polished RefSeq alignments, andjQtto (homology) annotations (annotations produced by 
supplying all available evidence to Genscan) "were tallied. These data.show.the degree to which 
multiple Genscan predictions and/or Otto annotations were associated "with* a single RefSeq 
transcript. The zero class for the Otto-homology predictions shown here indicates that the 
Otto-homology calls were made without recourse to the Re/Seq transcript, and thus no Otto call 
was made because of insufficient evidence. 



Table 8. Numbers of exons and transcripts supported by various* types of evidence for Otto and de novo gene prediction methods. Highlighted cells indicate 
the gene sets analyzed in this paper (boldface,, set of genes selected for protein' analysis; italic, total set of accepted de novo predictions).- 



. . . 


• 


Total 




■ Types of evidence 






No. of lines of evidence* 








Mouse 


Rodent '. 


Protein 


• Human 


£=1 


2=2 




2:4 


♦ 

Otto 


Number of 


17.969 


17,065 


^ 

14,881 


15,477 - 


* 16,374 


■' *17,968f 


17.501 


15.877 


12.451 




transcripts 
Number of 


141.218 


111.174 

t 


89,569 


108,431* ' 


'"118,869 


140,710 


127.955 


99,574 


59,804 


* 

De novo 


exons 
Number of 


■ 

58.032 


• 

14463 


5.094 


8.043 


9,220 


21,350 


8,619 


4.947 


1.904 




transcripts . 
Number of 


319.935 


* 

48.594 


19.344 


26,264 


. 40,104 


79.148 . 


31,130 


17.508 


6,520 


No. of exons per 
transcript 


exons 
Otto 
De novo 


7.84 
5.53 


. i- . « 

5.77 
3.17 


6.01 
3.80 


6.99 
3.27 


7.24 
4.36 


7.81 
3.7 


7.19 
3.56 


6.00 
3.42 


4.28 
3.16 



•Four kinds of evidence (conservation In 3X mouse genomic DNA, similarity to human EST or cDNA. similarity to rodent EST or cDNA. and similarity to known proteins) were 
considered to support gene predictions from the different methods. The use of evidence Is quite liberal requiring only a partial match to a single exon of predicted transcnpt. frws 
number Indudes alternative splice forms of the 17,764 genes mentioned elsewhere In the text 
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Examination of pericentromeric regions is 
ongoing. 

The remaining —80% of the genome, the 
euchromatic component, is divisible into G-, 
and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, although we 
have been unable to determine precise band 
boundaries at the molecular level. T-bands are 
the most G+C- and gene-rich, and G-bands are 
,G+C-poor (68). Bernard! has also offered a 
description of the euchromatin at the molecular 
level as long stretches of DNA of differing base 
composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300 kbp.in length 
(69). Bemardi defined the L (h'ght) isochores as . 
G+C-poor (<43%), whereas the H (heavy) 
isochores fall into three G+C-rich classes rep- 
resenting 24, 8, and 5% of the genome. Gene . 
. concentration has been claimed to be very low 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70). By examining 
contiguous 50-kbp windows of G+C content 
across the assembly, we found that regions of 
G+C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (H 1 +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) : was 
.1 078.6 kbp. The correlation between G+C 
content and gene density, was also examined in .. • 
50-kbp windows along the assembled sequence 
(Table 9 and Figs. 10 and 11). We found that 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69). A higher 
proportion of genes were located in the G+C- 
poor regions than had been expected. 

Chromosomes 17, 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest- gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene density, X, 4, 
1 8, 13, and Y, also have the fewest H3 bands. 
Chromosome 15, which also has few H3 
bands, did not have a particularly low gene 
density in our analysis. In addition, chromo- 
some 8, which we found to have a low gene 
density, does not appear to be unusual in its 
H3 banding. 

. . How valid is .Qhno's postulate (71) that 
mammalian genomes consist of oases of genes 
in otherwise essentially empty deserts? It ap- 
. pears that the human genome does indeed con- 
. tain deserts, or large," gene-poor regions. If we 
define a desert as a region >500 kbp without a 
. gene, then we see that 605 Mbp; or about 20% 
\ of the. "genome, is in deserts. These are not 
uniformly distributed over the various chromo- 
somes. Gene-rich chromosomes 17, 19, and 22 
have only about 12% of their collective 171 
Mbp in deserts, whereas gene-poor chromo- 
• somes 4, 13, 18, and X haw 27.5% of their 492 
Mbp in deserts (Table 1 1). The apparent lack of 
predicted genes in these regions does not nec- 
essarily imply that they are devoid of biological 
function. 

4.2 Linkage map 

Linkage maps provide the basis for genetic 
. analysis and are widely used in the study of the 
inheritance of traits and in the positional clon- . 
ing of genes. The distance metric, centimdrgans 
(cM), 'is based on the recombination rate be- . 
tween homologous chromosomes during meio- 

Table 9.. Characteristics of G+C in isochores. 



sis. In general, the rate of recombination in 
females is greater than that in males, and this 
degree of map expansion is not uniform across 
the genome (72). One of the opportunities en- 
abled by a nearly complete genome sequence is 
. to produce the ultimate physical map, and to 
fully analyze its correspondence with two other 
maps that have been widely used in genome 
. and genetic analysis: the linkage map and the 
cytogenetic map. This would close the loop 
between the mapping and sequencing phases of 
the genome project 

We. mapped the location of the markers 
that constitute the Genethon linkage map to 
the genome. The rate of recombination, ex? 
pressed as cM per Mbp, was calculated for 
3 -Mbp windows as shown in Table 12. High- 
er rates of recombination in the telomeric 
region of the chromosomes have been previ- 
ously, documented (73). From this mapping 
result, there is a difference of 4.99 between 
lowest rates and highest rates an4 the largest 
difference of 4.4 between males and females 
(4.99 to 0.47 on chromosome 16). This indi- 
cates that the variability in recombination 
rates among regions of the genome exceeds 
the differences in recombination rates be- 
tween males and females. The human ge- 
nome has recombination hotspots, where re- 
combination rates vary fivefold or more over 
a space of 1 kbp, so the picture one gets of the 
magnitude of variability" in recombination - 
rate will depend on the size of the window 



Isochore 


• 

G+C (%) 


Fraction of genome 


Fraction of genes 


Predicted* 


• . Observed 


Predicted* 


Observed 


H3 


>48 


5 


9.5 


37 


24.8 


H1/H2 


43-48 


25 


21.2 


32 


26.6 


L 


<43 


67 


69.2 


31 


48.5 



'The prediction* wece. based on Bemardi's definitions (70) of the Isocfiore structure of the human genome. 
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Fig. 9. Comparison of 
the number of exons 
per transcript between. „ 
the 17.968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one . 
line of evidence that 
do not ovedap with an 
Otto prediction. Both . 
sets have the highest 
number of transcripts ' 
In the two-exon cate- 
gory, but the de novo 
gene predictions are 
skewed much more 
toward smaller tran- 
scripts. In the Otto set. 
19.7% of the tran- 
scripts have one or 
two exons. and 5.7% 
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have more than 20. In the de novo set, 493% of the transcripts have one or two exons. and 0.2% have more than 20. 
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examined. Unfortunately, too few meiotic 
crossovers have occurred in Centre d* Etude 
du Polymorphism Humain (CEPH) and other 
reference families to provide a resolution any 
finer than about 3 Mbp. The next challenge 
will be to deteixnine a sequence basis of 
recombination at the chromosomal level. An 
accurate predictor for the rate for variation in 
recombination rates between any pair of 
markers would be extremely useful in design- 
ing markers to narrow a region of linkage, 
such as in positional cloning projects. 

4.3 Correlation between CpG islands 
and genes 

CpG islands are stretches of unmethylated 
DNA with a higher frequency of CpG 
d [nucleotides when compared with the entire 
genome (74). CpG islands are believed to 
preferentially occur at the transcriptional start 
of genes, and it has been observed that most 
housekeeping genes have CpG islands at the 
5' end of the transcript (75, 76). In addition, 
experimental evidence indicates that CpG is- 
land methylation is correlated with gene in- 
activation (77) and has been shown to be 
important during gene imprinting (78) and 
tissue-specific gene expression (79) 

Experimental methods' have been used 
that resulted in an estimate of 30,000 to 
45,000 CpG islands in the human genome 
(74, 80) and an estimate of 499 CpG islands' 
on, human chromosome 22 (81). Larserl et' 
al. (76) and Gardiner-Garden and Frommer ' 
(75) used a computational method to iden- 
tify CpG islands and defined them as re- 
gions of DNA of >200 bp that have aG+C 
content of >50% and a ratio of observed 
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versus expected frequency of CG dinucle- 
otide >0.6. 

It is difficult to make a direct compari- 

. son of experimental definitions of CpG is- 
lands with computational definitions be- 
cause computational methods do not con- 
sider the methylation state of cytosine and 
experimental methods do not directly select 
regions of high G+C content. However, we 
can determine the correlation of CpG island 

."with gene.starts, given "a set of annotated 
genomic transcripts and the whole genome 
sequence. We have analyzed the publicly 
available annotation of chromosome 22, as-., 
well as using the entire human genome in 
our assembly and the computationally an- 
notated genes. A variation of the CpG is- 
land computation was compared with 
Larsen et ah (76). The main differences are 
that we use a sliding window of 200 bp, 
consecutive windows are merged only if 
they overlap, and we recompute the CpG 
value upon merging, thus rejecting any po- 
tential island if it scores less than the 
threshold. 

To compute various CpG statistics, we * 
used two different thresholds of CG dinucle- 
otide likelihood ratio. Besides using the orig- 
inal threshold of 0.6 (method 1), we used a 
higher threshold of CG dinucleotide likeli- 
hood ratio of 0.8 (method 2), which results in 
the number of CpG islands on chromosome ; 
22 close to the number of annotated genes on 
this chromosome. The main results are sum- 
marized in Table 13. CpG islands computed 
with method.!. predicted only 2.6% of the : 
CSA sequence as CpG, but 40% of the gene 
starts "(start codons) are contained inside. a • 
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CpG island. This is comparable to ratios re- 
ported by others (82). The last tw*o rows of 
the table show the observed and expected 
average distance," respectively, of the closest 
CpG island from the fust exon. The observed 
average closest CpG islands are smaller than 
the corresponding expected distances, con- 
firming an association between CpG island 
and the first exon. 

.' .We also looked at the distribution of CpG 
island nucleotide's among various sequence 
classes such as intergenic regions, introns, 
exons, and first exons. We computed the 
^ likelihood score for each sequence class as 
the ratio of the observed fraction of CpG 
island nucleotides in that sequence class 
and the expected fraction of CpG island 
nucleotides in that sequence class. The re- 
sult of applying method 1 on CSA were 
scores of 0.89 for intergenic region, 1.2 for 
introo, 5.86 for exon, and 13.2 for first 
exon. The same trend was also found for 
chromosome 22 and after the application of 
a higher threshold (method 2) on both data 
sets. In sum, genome-wide analysis has 
extended earlier analysis and suggests a 
strong correlation between CpG islands and 
first coding exons. 

4.4 Genome-wide repetitive elements 

The proportion of the genome covered by 
various classes of repetitive DNA is present- 
ed in Table 14. We observed about 35% of 
the genome in these repeat classes, very sim- 
ilar to values reported previously (83). Repet- 
itive sequence may'be underrepresented in 
the Celera assembly as a result of incomplete 
repeat resolution, as discussed above. About 
8% of the scaffold length is in gaps, and we 
expect that much of this is repetitive se- 
quence. Chromosome 19 has the highest re- 
peat density (57%), as well as the highest 
gene density (Table 10). Of interest, among 
the different classes of repeat elements, we 
observe a clear association of Alu elements 
and gene' density, which was not observed 
between LINEs and gene density. 

5 Genome Evolution 

• • • * 

Summary. The dynamic nature of genome 
evolution can be captured at several levels. 
These include gene duplications.mediated by . 
RNA intermediates (retrotransposition) and 
segmental genomic duplications. In this sec- 
tion, we document the genome-wide occur- 
rence of retrotransposition events generating 
functional (intronless paralogs) or inactive 
genes (pseudogenes). Genes involved in 
translational processes and nuclear regulation 
account for nearly 50% of all intronless para- 
logs and processed pseudogenes detected in 
our survey. We have also cataloged the extent 
of segmental genomic duplication and pro- 
vide evidence for 1077 duplicated blocks 
covering 3522 distinct genes. 
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Fig. 11. Genome structural features. 
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5.1 Retrotransposition in the human 
genome 

Retrotransposition -of processed mRNA 
transcripts into the genome results in func- 
tional genes, called intronless paralogs, or 
inactivated genes (pseudogenes). A paraiog 
refers to a gene that appears in more than 
one copy in a given organism as a result of 



a duplication event. The existence of-bbth 
intron-containing and intronless forms of 
genes encoding functionally similar or 
identical' proteins has been previously de- 
scribed. (£4, 85). Cataloging these evolu- 
tionary events on the genomic landscape is 
of value in understanding the functional 
consequences of such gene-duplicatidri 



events in cellular biology. Identification of 
conserved intronless paralogs in the mouse 
or other mammalian genomes should pro- 
vide the basis for capturing the evolution- 
ary chronology of these transposition 
events and provide insights into gene loss 
and accretion in the mammalian radiation. 
A set of proteins corresponding to all 901 
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Otto-predicted, slngle-exon genes were sub- 
jected to BLAST analysis against the proteins 
encoded by the remaining multiexon predict- 
ed transcripts. Using homology criteria of 
70% sequence identity over 90% of the 
length, we identified 298 instances' *of single- 
to multi-exon correspondence. Of these 298 
sequences, 97 were represented in the Gen- 
Bank data set of experimentally validated 
fulMength genes at the stringency specified 
and were verified by manual inspection. ; 

; We believe, that these .97" cases may rep"-' 
resent intronless paralogs (see Web table 1 on 
Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DCl) of known 
genes. Most of these are flanked by direct 
repeat sequences, although the precise nature 
of these repeats remains to be determined. All * 
of the cases for which we have high confi- 
dence contain polyadenylated [poly (A)] tails 
characteristic of retrotransposition. 

Recent publications describing the. phe- 
nomenon of functional intronless paralogs 
speculate that retrotransposition may serve as 
a mechanism used to escape X-chromosomal 
inactivation (84, 86). We do not find a bias 
toward X chromosome origination of these 
retrotransposed genes; rather,— the results 
show a random chromosome distribution of 
both the in tron- containing and corresponding 
intronless paralogs. We also have found sev- 
eral cases of retrotransposition from a single* 
source chromosome to multiple target chxo-* \ 
mosomes. Interesting examples include the * 
retrotransposition of a five exon-cpntaining 
ribosomal protein L21 gene on chromosome 
13 onto chromosomes 1, 3, 4, 7, 10, and 14, 
respectively. The size of the source genes can 
also show variability. The largest example is 
the 31-exon diacylglycerol kinase zeta gene* 
on chromosome 11 that has an intronless 
paralog on chromosome 13. Regardless of 
route, retrotransposition with subsequent 
gene changes in coding or noncoBing regions 
that lead to different functions or expression 
patterns, represents a key route to providing 
an enhanced functional repertoire in mam- 
mals (87). 

Our preliminary 'set of retrotransposed in- 
tronless paralogs* contains a clear oyerreprer ♦ ■ 
sentatiori of genes involved in translational 
processes (40% ribosprhal* proteins and. 10% 
translation elongation factors) and "nuclear 
regulation (HMG nonhistone proteins, 4%), 
as well as metabolic and regulatory' enzymes. 
EST matches specific to a subset of intronless 
paralogs suggest expression of these intron- 
less paralogs. Differences in the upstream- 
regulatory sequences between the source 
genes and their intronless paralogs could ac- 
count for differences in tissue.-spectfic gene 
expression. Defining which, if any, of these 
processed genes are functionally expressed 
and translated will require further elucidation 
and experimental validation. 



5.2 Pseudogenes 

A pseudogene is a nonfunctional copy that is 
very similar to a normal gene but that has 
been altered slightly so that it is not ex- 
Table 11. Genome overview. 



pressed. We developed a method for the pre- 
liminary analysis of processed pseudogenes 
in the human genome as a starting point in 
elucidating the ongoing evolutionary forces 



Size of the genome (including gaps) 
Size of the genome (excluding gaps) 

Longest contig .' * *• .. : • .. * * v 

. Longest scaffold '-- . •. "•. . • ':.* • 

Percent of A+T In the genome 
Percent of G+Cjn the genome 
Percent of undetermined bases in the genome 
Most GC-rich 50 kb 
Least GC-rich 50 kb 
Percent of genome classified as repeats 
Number of annotated genes 
Percent of annotated genes with unknown function 
Number of genes (hypothetical and annotated) 
Percent of hypothetical and annotated genes with unknown function 
Gene with the most exons 
Average gene size 
Most gene-rich chromosome 
Least gene-rich chromosomes 

Total size of gene deserts (>500 kb with no annotated genes) 
Percent of base pairs spanned by genes 
Percent of base pairs spanned by exons v 
Percent of base pairs spanned by Introns 
Percent of base pairs in intergenic DNA 

Chromosome with highest proportion of DNA in annotated exons 
Chromosome with lowest proportion of DNA in annotated exons 
Longest intergenic region (between annotated + hypothetical genes) 
Rate of SNP variation 



2.91 Gbp 
2.66 .Gbp 
1.99 Mbp" ' . 

14.4 Mbp, . • 

54 .. 
38 
■9 

Chr. 2 (66%) 
Chr. X (25%) 
35 

26,383 
42 

39,114 
59 

Titin (234 exons) 
27 kbp 

Chr. 19 (23 genes/Mb) 
Chr. 13 (5 genes/Mb), 
Chr. Y (5 genes/Mb) 
605 Mbp s 

25.5 to 37.8* 
1.1 to 1.4* 

24.4 to 36.4* \ 

74.5 to 63.6* 
Chr. 19(9.33) 
Chr. Y (0.36) 

Chr. 13 (3,038,416 bp) 
1/1250 bp 



"•In these range*, the percentage* correspond to the annotated gene set (26, 383 genes) and the hypothetical + 
annotated 'gene set (39,114 genes), respectively. • .' . V .* 

. - - ♦ 

Table 12. Rate of recombination per physical distance (cM/Mb) across the genome: Genethon markers 
were placed on CSA-mapped assemblies, and then relative physical distances and rates were calculated 
in 3-Mb windows for each chromosome. NA, not applicable. 



Male 



Chrom. 



Max 



Avg. 



1 

2 
3 
4 
5 
6 
1 
8 
9 

-io- 
ii 

12 
13 
14 
15 
16 

17/ : 
18 

19 . 
20 
21 . 
22 
X 
Y 

Genome 



2.60 
. -2.23 

2.55 
' 1.66 

2.00 - 
' 1.97,. 

• '234 * 

.. V83 
2.01 
.3.73 
1.43 . 
4.12 - ' 
1.60 
3.1* 
2.28 
1.83 
3.87 
3.12 
3.02 
3.64 
3.23 
1.25 
NA 
NA 

4.12 



1.12 
0.78 
0.86 
0.67 
0.67 

0,71 
1.16 
0.73 
0.99 
1.03 
0.72 
0.76 
0.75 
0.98 
0.94 
1.00 
0.87 
137 
0.97 
0.89 
1.26 
1.10 
NA 
NA 

0.88 







Sex-average 






Female 




Min. 


Max, 


• . Avg. 


Min. 


Max. 


Avg. 


Min. 


0.23 


2.81 


1.42 


0.52 


339 


1.76 


0.68 


0.33 


2.65 


1.12 


0.54 


3.17 


1.40 


0.61 


0.23 


2.40 


1.07 


0.42 


2.71 


1.30 


033 


0.15 


2.06 


1.04 . 


. 0.60 


2.50 


1.40 


0.77 


0.18 


1.87 


1.08 


0.42 


2.26 


1.43 


0.62 


0.28 


2.57 


1.12 


037 


3.47 


1.67 


0.64 


0.48 


1.67 


1.17 


0.47 


2.27 


1.21 


034 


0.14 


2.40 


- 1.05 


0.46 


3.44 . 


• 136 


0.43 


0.53 


1.95 


132 


0.77 


2.63 


. '1.66 


• 0.82 


0.22 


3.05 


129 


. 0.66 


2.84 


1.51 


0.76 


031 


2.13 


039 


0.47 


3.10 


T32 


0.49 


0.26 


335 


1.16 . 


0.49 


2.93 


1.55 


0.59 


0.01 


1.87 


0.95 


0.17 


2.49 


1.19 


032 


0.18 


2.65 


130 


0.62 


3.14 


1.63 


0.75 


034 


231 


1.22 


0.42 


233 


1.56 


0.54 


0.47- 


' 2.70 


1.55 


0.63 


4.99 


2.32 


1.12 


0.00 


334 


135 


0.54 


4.19 


1.83 


0.94 


0.86 


3.75 


1.66 


0.43 


435 


2.24 


0.72 


0.10 


237 


1.41 


0.49 


. 2.89 


1.75 


0.87 


0.00 


2.79 


1.50 


0.83 


331 


2.15 • 


134 


0.69 


237 


1.62 . 


♦1.08 


2.58 


1.90 


1.18 


0.84 


1.88 


\A\ 


1.08 


3.73 


2.08 


0.93 


NA 


NA 


NA 


NA 


3.12 


1.64 


U72 


NA 


- NA 


NA 


NA 


. NA 


NA 


NA 


0.00 


3.75 


122 


0.17 


4.99 


1.55 


032 
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. that account for gene inacuVation. The gen- 
eral structural characteristics of these pro- 
cessed pseudogenes include the complete 
lack of intervening sequences found in the 
functional counterparts, a poly(A) tract at the 
3* end, and direct repeats flanking the pseu- 
dogene sequence. Processed pseudogenes oc- 
cur as a result of fetrotransposition, whereas 
unprocessed pseudogenes arise from segmen- 
tal genome duplication. 

We searched the complete set of Otto- 
predicted transcripts against the genomic se- 
quence by means of BLAST. Genomic re- 
gions corresponding to all Otto-predicted- 
transcripts were excluded from this analysis. 
We identified 2909 regions matching with 
greater than 70% identity over at least 70% of 
the length of the transcripts that likely repre- 
sent processed pseudogenes. This number is 
probably an underestimate because specific 
methods to search for pseudogenes were not 
used. 

. We looked for correlations between 
structural elements and the propensity, for 
rerrotransposition-in'the human genome. 
GC content and transcript length were com- 
pared between the genes with processed - 



pseudogenes (1177 source genes) versus 
the remainder of the predicted gene set. 
Transcripts that give rise to processed pseu- 
dogenes have shorter average transcript 
length (1027 bp versus 1594 bp for the Otto 
set) as compared with genes for which no 
pseudogene was detected. The overall GC 
: content did not show any significant differ- 
. ence, contrary fo a recent report (88). There 
is a clear trend in gene families that are 
present as processed pseudogenes. These 
include ribosomal' proteins (67%), lamin 
receptors (10%), translation elongation fac- 
tor alpha (5%), and HMG-non-histone pro- 
teins (2%). The increased occurrence of 
•:■ retro transposition (both intronless paralogs. 
and processed pseudogenes) among genes 
involved in translation and nuclear regula- 
tion may reflect an increased transcription- 
al activity of these genes. 

5.3 Gene duplication in the human 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 

. rithm, called Lek, for grouping the predicted . 

■ human protein set into protein families (89). 



Table 13. Characteristics of CpG islands identified in chromosome 22 (34-Mbp sequence length) and the * 
whole genome (2.9-Cbp sequence length) by means. of two different methods. Method 1 uses a CC 
likelihood ratio of £*0.6. Method 2 uses a CG likelihood ratio of *z0.8. 



Chromosome 22 



Whole genome. 
(CS assembly) 



Method 1 



Method 2 



Method 1 



Method 2 



Number of CpG Islands 

detected 
Average length of island (bp) 

Percent of sequence 

predicted as CpG 
Percent of first exons that 

overlap a CpG Island 
Percent of first exons with 

first position of exon 

contained Inside a CpG .. r 

island- - ;;: 

Average distance between 
first exon and t closest CpG . 
Island (bp) . " 

Expected distance between '-"* 
first exon and closest CpG 

• - Island (bp) 



5.211 

390 
53 

44 

37 



"f.013 



3,262 



522 

535 
0.8 

25 

22 . 

10,486 

32,567* V ' 

.>■>-> 



195,706 

395 
2.6 

42 

40 



26,876 

497- 
0.4 

22 

21 



2,182 



7.164 



17.021 



55,811 



Table 14. Distribution of repetitive DNA In the compartmentalized shotgun assembly 'sequence. - . 


• ■ • 

Repetitive elements 


Mejjabases In . 
• ' • assembled 
...sequences 


Percent 
assembly 


Previously 
predicted 
{%) (83) 


Alu 

Mammalian Interspersed repeat (MIR) 
Medium reiteration (MER) 
Long terminal repeat (LTR) 
Long Interspersed nucleotide element 
(LINE) 

Total 


288 
66 
50 . 
155 
466 

1025 


9.9 
23 
1.7 
53 
16.1 

353 


10.0 
1.7 
1.6 
5.6 

16.7 

35.6 



•The complete clusters that result from tl 
Lek clustering provide one basis for compa 
ing the role of whole-genome or chromoson 
al duplication in protein family expansion ; 
opposed to other means, such as tandem di 
plication. Because each complete cluster re; 

• resents a closed and certain island of homo 
ogy, and because Lek is capable of simulfc 
neously clustering protein complements c 

• several organisms, the number of protein 
contributed by each organism to a complet 
cluster can be predicted with confidence de 
pending on the quality of the annotation c 
each genome. -The variance of each organ 
ism's contribution to each cluster can then b 
calculated, allowing an assessment of the rel 
arive importance of large-scale duplicatio; 
versus smaller-scale, organism-specific ex 
pansion and contraction of protein families 
presumably as a result of natural selectioi 
operating on individual protein families with 
in an organism. As can be seen in Fig. 12, the 
large variance in the relative numbers of hu 
man as compared with D. melanogaster am 
Caenorhabditis elegans proteins in complet* 
clusters may be explained by multiple event 
of relative expansions in gene families ir 
each of the three animal genomes. Such ex- 
pansions would give rise to the distributioi 
that shows a peak at 1:1 in the ratio fo: 
human-worm or human-fly clusters with the 
slope spread! covering both human and fly/ 
worm predominance, as we observed (Fig. 
12). Furthermore, there are nearly as man) 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selection acting od 
individual protein families has been a major 
force driving the expansion of at least some 
elements of the human protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 
Using two independent methods, • we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly 
conserved blocks of duplication. We then 
describe our comprehensive method for identi- 
fying all interchromosomal block 'duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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termined to be in the same family and the 
same complete Lek cluster (essentially 
paralogous genes) (89). Initially, each chro- 
mosome was represented as a string of genes 
ordered by the start codons for predicted 
genes along the chromosome. We considered 
the two strands as a single string, because 
local inversions are relatively common events 
relative to large-scale duplications. Each 
gene was indexed according to the protein 
family and \Lek" complete 'cluster (89). All 
pairs, of. indexed gene .strings . were then 
aligned in both the forward and reverse di- 
rections with the Smith-Waterman algorithm 
(90), A match between two proteins of the 
same Lek complete cluster was given a score 
of 10 and a mismatch —10, with gap open 
and extend penalties of —4 and —1. With 
these parameters, 19 conserved interchromo- 
somal blocks of duplication were observed, 
all of which were also detected and expanded 
by the comprehensive method described be- 
low. The detection of only a relatively small 
number of block duplications was a conse- 
quence of using an intrinsically conservative 
method grounded in the conservative con- 
straints of the complete Lek clusters. 

In the second, more comprehensive ap- 
proach, we aligned all chromosomes directly 
with one another using an algorithm based on 
the MUMmer system (91). This alignment 
method uses a suffix tree data structure and a~ 
linear- time algorithm to align long sequences \ 
very rapidly; for example, two chromosomes, . - 
of 100 Mbp can be aligned in less than 20 ' 
min (on a Compaq Alpha computer) with 4 
gigabytes of memory. This procedure was 
used recently to identify numerous large- 
scale segmental duplications among the five 
chromosomes of A. thaliana (92); in that, 
organism, the method revealed that 60% of 
the genome (66 Mbp) is covered by 24 very 
large duplicated segments. For Arabidops is, a 
DNA-based alignment was sufficient to re- 
veal the segmental duplications between 
chromosomes; in the human 'genome, DNA 
alignments at the whole-chromosome level 
are insufficiently sensitive. Therefore, a mod- 
ified procedure was developed and applied, 
as^ follows. First, lall 26,588 - proteins m . 
(9,675,713 million amino acids) were concat- 
enated end-to-end in; .Order as they occur 
along each of the 24. chromosomes, irrespec- 
tive of strand location. The concatenated pro- 
tein set was then aligned against edeh chro- 
mosome by the MUMmer algorithm. The 
resulting matches were clustered to extract all 
sets of three or more protein matches that - 
occur in close proximity on two different 
chromosomes (93); these represent the can- 
didate segmental duplications. A series of 
filters were developed and applied to remove 
likely false-positives from this set; for exam- 
ple, small blocks that were spread across 
many proteins were removed. To refine the 
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filtering methods, a shuffled protein set was 
first created by taking the 26,588 proteins, 
randomizing their order, and then partitioning 
them into 24 shuffled chromosomes, each 
containing the same number of proteins as the 
true genome. This shuffled protein set has the 
identical composition to the real genome; in 
particular,: every protein and every domain 
appears the same number of times. The com- 
plete algorithm was then applied to both the 
••real and the shuffled data," with the results on 
' the shuffled data being used to estimate the . 
false-positive rate. The algorithm after filter- 
ing yielded 10,310 gene pairs in 1077 dupli- 
cated blocks containing 3522 distinct genes; 
tandemly duplicated expansions in many of 
the blocks explain the excess of gene pairs to 
distinct genes. In the shuffled data, by con- 
trast, only 370 gene pairs were found, giving 
a false-positive estimate of 3.6%. The most 
likely explanation for the 1077 block dupli- 
cations is ancient segmental duplications. .In 
many cases, the order of the proteins has been 
shuffled, although proximity is preserved. 
Out of the 1077 blocks, 159 contain only 
three genes, 137 contain four genes, and 781 
contain five or more genes. ^ 

To illustrate the extent of the detected 
duplications, Fig. 13 shows all 1077 block 
duplications indexed to each chromosome in 
24 panels in which only duplications mapped 
to the indexed chromosome are displayed. 
' The figure makes it clear that the duplications 
' are .ubiquitous in the genome. One feature • 
that it displays is many relatively small chro- • 
mosomal stretches, with one-to-many dupli- 
cation relationships that are graphically strik-* 
ing. One such example captured by the anal- = . 
ysis is the well-documented olfactory recep-" 
tor (OR) family, which is scattered in blocks . 
throughout the genome and which has been 
analyzed for genome-deployment reconstruc- 



tions at several evolutionary stages (94). Th 
figure also illustrates that some chrome 
somes, such as chromosome 2, contain man- 
more detected large-scale duplications thaj 
others. Indeed, one of the largest duplicate< 
segments is a large block of 33 proteins oi 
chromosome 2, spread among eight smalle 
blocks in 2p, that aligns to a paralogous set oi 
chromosome 14, with one rearrangement (se< 
chromosomes 2 and 14 panels in Fig. 13) 
. : The proteins are not contiguous but span ; 

region* containing 97 proteins : oh chromo 
.'some 2 and 332 proteins on chromosome 14 
The likelihood of observing this many dupli- 
cated proteins by chance, even over a span oi 
this length, is 2.3 X 10" 68 (93). This dupli 
cated set spans 20 Mbp on chromosome 2 anc 
63 Mbp on chromosome 14, over 70% of the 
latter chromosome. Chromosome 2 also con- 
tabs a block duplication that is nearly as 
*. large, which is shared by chromosome arm 2q 
.and chromosome 12. This duplication incor- 
porates two of the four known Hox gene 
clusters, but considerably expands the extent 
of the duplications proximally and distally on 
the pair of chromosome arms. This breadth of 
. duplication is also seen on the two chromo- 
somes carrying the other two Hox clusters. 

An additional large duplication, between 
chromosomes .18 and 20/ serves as a good 
example to illustrate some of the features 
common to many of the other observed large 
duplications (Fig. 13, inset).* This duplication 
contains 64 detected ordered mtrachLromo- 
-somal pairs of homologous genes. After dis- 
counting a 40-Mb stretch of chromosome 18 
*: free of matches to chromosome 20, which is 
likely to represent a large insert (between the 
gene assignments "Krup rel" and "collagen 
rel" on chromosome 18 in Fig. 13), the full 
duplication segment covers 36 Mb on chro- 
mosome 18 and 28 Mb on chromosome 20. 
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fig. 12. Gene duplication In complete protein clusters. The predicted protein sets of human, worm, 
and fly were subjected to Lek clustering (27). The numbers of clusters with varying ratios (whole 
number) of human versus worm and human versus fly proteins per cluster were plotted. 
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By this measure, the duplication segment 
spans nearly half of each^-chromosome's net 
length. The most likely scenario is that the 
whole span of this region was duplicated as a 
single very large block, followed by shuffling 
owing to smaller scale rearrangements. As 
such, at least four subsequent rearrangements 
would need to be invoked to explain the 
relative insertions and inversions seen in the 
duplicated segment interval. The 64 protein 
pairs in this alignment occur among 217 pro- 
tein assignments on chromosome 18, and 
among 322 protein assignments on chromo- 
" some 20, for a density of involved proteins of 
20 to 30%.This is consistent with an ancient 
large-scale duplication followed by subse- 
quent gene loss on one or both chromosomes. 
Loss of just one member of a gene pair 
subsequent to the duplication would result in 
a failure to score a gene pair in the block; less 
than 50% gene loss on the chromosomes 
would lead to the duplication density ob- 
served here. As* an independent verification 
of the significance of the alignments detect- 
ed, it can be seen that a substantial number of 
. the pairs of aligning proteTns in this duplica- 
. .tion, including some of those annotated (Fig. 
13), are those populating small Lek complete 
clusters (see above). This indicates that they 
are members of very small families ; of para- • 
logs; their relative scarcity within the genome - 
validates the uniqueness and robust nature of 
their alignments. 

Two additional qualitative features were ob-, 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
ciations, with OMIM (Online Mendelian Inher- 
itance in Man) assignments, are members of 
duplicated segments (see web table 2 on Sci- 
ence Online at www.scierieemag.org/cgi/con- 
tent/full/291/5507/1304/DCl). We have also 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are -associated with 
bleeding . disorders, / transcriptional regulators 
like the homeobort proteins associated with" de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of * these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights . 
into disease causation, with further investiga- ' 
tion needed to determine whether they might be \ 
involved in the same or similar genetic diseases."* 
Second, although there is a conserved number, 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 
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■pair of duplicated chromosome regions was 
observed in many compared regions. Hypothe- 
ses to explain which mechanisms foster these 
processes must be tested. 

Evaluation of the alignment results gives 
some perspective on dating of the duplications. 
As noted above, large-scale ancient segmental 
•.. duplication in* fact best explains many of the 
.;: blocks detected by this genome-wide analysis. 
The regions, of human chromosomes involved 
in the large-scale duplications expanded upon 
. above (chromosomes 2 to 14, 2 to 12, and 18 to 
20) are each syntenic to a distinct mouse chro- 
mosomal region. The corresponding mouse*. 
• . .chromosomal regions are much more similar in 
sequence conservation, and even in order, to 
. their human synteny partners than the human 
duplication regions are to each other. Further, 
the corresponding mouse chromosomal regions 
each bear a significant proportion of genes or- 
thologous to the human genes on which the 
.- human duplication assignments were made. On 
the basis of these factors, the corresponding 
mouse chromosomal spans, at coarse resolu- 
tion, appear to be products of the same large- 
scale duplications observed in humans. Al- 
though further detailed analysis must be carried 
out once a more complete genome is assembled 
for mouse, ■ the underlying large duplications 
: appear .to predate the two species* divergence. 
. This dates the duplications, at the latest, before . 
; divergence of the primate and rodent lineages. 
■• This date can be further refined upon examina- 
tion of the synteny between human chromo- 
somes and those of chicken, puffe'rfish (Fugu 
rubripes), or 2ebrafish (95). The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 
duplications are restricted to the Hox cluster 
regions. When the synteny of these regions 
(or others) to human chromosomes is extend- m . 
ed..with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
'vertebrale divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome; The extent of segmental duplica- 
tions raises the Question of whether an ancient 
whole-gentime duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96). The duplications have undergone 
many deletions and subsequent rearrangements; 
these events make it difficult to distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



veal the stage wise history of our genome, ar* 
with it a history of the emergence of many 0 ; 
the key functions that distinguish us from othc 
living things. 

6 A Genome-Wide Examination of 
Sequence Variations 

Summary. Computational methods were used 
- to identify single-nucleotide polymorph isrru 

, . (SNPs) by comparison of the Celera sequence 
• to other SNP resources. The SNP rate be- 
tween two chromosomes was ~-l per 1200 lo 
1500 bp. SNPs are distributed nonrandomly 
.throughout the genome. Only a . very small 

. proportion of all SNPs (<1%) potentially 
impact protein function based on the func» 
tional analysis of SNPs that affect the pre- 
dicted coding regions. This results in an es- 
timate that only thousands, not millions, of 
genetic variations may contribute to the struc- 
tural diversity of human proteins. 

Having a complete genome sequence enables 
researchers to achieve a dramatic acceleration 
in the rate of gene discovery, but only through 
analysis of sequence variation in DNA can wc 
discover the genetic basis for variation in health 
among human beings. Whole-genome shotgun 
sequencing is a particularly effective method 
for detecting sequence variation in tandem with 
. whole-genome assembly.' In addition, we com- 
pared the.: distribution and attributes of SNl^ 
ascertained by three other methods: (i) align- 
ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality raids 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (P7), and (iii) reduced repre- 
sentation shotgun sequencing (referred to os 
•TSC"; 632,640 SNPs) (P^. These data were 
consistent in showing an overall nucleotide di- 
versity of -8 X 10~ 4 , marked heterogeneity 
across the genome in SNP density, and on 
-overwhelming preponderance of noncoding 
variation that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 
Ideally, methods of*SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of 
itive and false-negative calls with an expnc 
sampling model (99). Comparison of _ 
sequences in the absence of these detailsnctc 
sitated a more ad hoc approach (quality sc< 
could not readily be obtained for the "J* 
sembly). First, all sequence m *™ c *?£™^ A . 
'the two consensus sequences were I c ' 
these were then filtered to reduce the conm ^ 
tion of sequencing errors and ml ^ s * m *' An » 
a measure of the effectiveness of $ 
step, we monitored the ratio of ^J*"^ 
transversion substitutions, because a ^ • m _ 
has been well documented as typical in 
malian evolution (100) and in human . 



(101, J02). The filtering steps consisted of re- 
moving variants where the quality score in the 
Celera consensus was less than 30 and where 
the density of variants was greater than 5 in 400 
bp. These filters resulted in shifting the transi- 
tion-to-transversion ratio from -1.57-1 to 
1 .89 : 1 . When applied to 2.3 Gbp. of alignments 
between the Celera and PFP consensus se- 
quences, these filters resulted in identification 
of 2,104,820 putative SNPs from a total of • 
2,778,474 substitution differences. Overlaps 
between this set of SNPs and those found by " * 
other methods are described below. " ; " " 

6.2 Comparisons to public SNP 
databases 

Additional SNPs, including 2,536,021 from 
dbSNP (www.ncbi.nlra.nih.gov/SNP) and 
13,150 from HGMD (Human Gene Muta- 
tion Database, from the University of 
Wales, UK), were mapped on the Celera con- 
sensus sequence by a sequence similarity 
search with the program PowerBlast (103) The 
two largest data sets in dbSNP are the Kwok 
and TSC sets, with 47% and 25% of the dbSNP 
records. Low-quality alignments with partial 
coverage of the dbSNP sequence and align- 
menu that had less than 98% sequence identity 
between the Celera sequence and the dbSNP 
flanking sequence were eliminated. dbSNP se- 
quences mappiag to multiple locations on the 
Celera genome were discarded. A total of 
2,336,935 dbSNP variants were mapped to * 
1,223,033 unique locations on the Celera se- 
quence, implying considerable redundancy in' 
dbSNP. SNPs in the TSC set mapped to 
585,8 1 1 unique genomic locations, and SNPs in . 
the Kwok set mapped to 438,032 unique loca- 
tions^ The combined unique SNPs counts used 
in this analysis, including Celera-PFP TSC 
and Kwok, is 2,737,668. Table 15 shows that a ' 
substantial fraction of SNPs identified by one of 
•Jiese methods was also found by another meth- 
xL The very high overlap (36.2%) between the 
<Avok and Celera-PFP SNPs may be due in part 
o the use by Kwok of sequences-that went into 
he PFP assembly. The unusually low overlap 
16.4%) between the Kwok and TSC sets is due 
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to their being the smallest two sets. In addition 
24.5% of the Celera-PFP SNPs overlap with 
SNPs derived from the Celera genome se- 
quences (46). SNP validation in population 
samples is an expensive and laborious process, 
so confirmation on multiple data sets may pro^ 
. vide an efficient initial validation "in silico" (by 
computational analysis). 

One means of assessing whether the 
three sets of SNPs provide the same picture 
; of human variation is to. tally thie frequen- . 
cies of : the six possible- base : changes in ' 
each set of SNPs'(TabIe 16): Previous mea- 
sures of nucleotide diversity were mostly, 
derived from small-scale analysis on can- 
didate genes (101), and our analysis with 
all three data sets validates the previous 
observations at the whole-genome" scale. 



site. These data are not readily available, so 
we could not estimate nucleotide diversity 
from the TSC effort. Estimation of nucleo- 
tide diversity, from high-quality sequence 
overlaps should be possible, but again, 
more information is needed on the details 
of all the alignments. 

Estimation of nucleotide diversity from a 
-shotgun assembly entails calculating for each 
- column of the misalignment; the probability 
that, nvo or -more distinct alleles are present 
and the probability of defecting a SNP if 
fact the alleles have different sequence (i.e., 
the probability of correct sequence calls). The 
greater the depth of coverage and the higher 
the sequence quality, the higher is the chance 
of successfully detecting a SNP (105). Even 
after correcting for variation in coverage, the 
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th A w 0ve *P; o ''SNPj -from genome-wi'de 
NP databases. Tabte entries are SNP counts for 
ach pairof data sets. Numbers In parentheses are 
ie fraction of overlap, catculated as the count of 
vertappmg SNPs divided by the number of SNPs 
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•the SNPs found in the Kwok set, the TSC 
set, and in our whole-genome shotgun (46) 
in. this substitution pattern. Compared with 
the rest of the data sets, Celera-PFP devi- 
ates slightly from the 2:1 trans ition-to- 
transversion ratio "observed in the other 
SNP sets. This result is not unexpected, 
because some fraction of the computation- 
ally identified SNPs in the Celera-PFP 
comparison may in fact be sequence errors. 
A 2:1 rransitiomtransversion ratio for the 
bona fide SNPs would be obtained if one 
-J assumed that 15% of the sequence differ- 
* • ences in the Celera-PFP set were a result of 
; (presumably random) sequence errors. 

* * ■ • * 

6.3 Estimation of nucleotide diversity 
from ascertained SNPs 

The number of SNPs identified varied 
widely across' chromosomes. In order to 
normalize these values to the chromosome 
size and sequence coverage, we used 77, the 
standard statistic for nucleotide diversity 
(104). Nucleotide diversity is a measure, of 
per-site ; heterozygosity, . quantifying the 
probability that a" pair of chromosomes - 
drawn from the population will differ at a 
nucleotide site. In order to calculate nucle- 
otide diversity for .each chromosome, we 
need to know the' number of nucleotide 
sites, that were surveyed for variation, and 
in methods like reduced representation se- 
quencing, we need to know the sequence 
quality and the depth' of coverage at each 



autosomes. The significance of this heteroge- 
neity was tested by analysis of variance, with 
estimates of tt for 100-kbp windows to esti- 
mate variability within chromosomes (for the 
Celera-PFP comparison, F = 29.73 P < 
0.0001). J 

Average diversity for the autosomes es- 
timated from the Celera-PFP comparison 
was 8.94 X 10- 4 . Nucleotide diversity on 
the X chromosome was 6.54 X I0~ 4 . The 
X is expected to be less variable than au- 
tosomes, because foe every four copies of 
autosomes in the population, there are only 
three X chromosomes, and this smaller ef- ' 
fective population size means that random 
drift will more, rapidly remove variation 
from the X (106). ' ' ' ' 

Having .ascertained nucleotide variation 
.genome-wide, it appears that previous esti- 
mates of nucleotide diversity in humans 
based on samples of genes were reasonably 
accurate (101, 102,106, 107). Genome-wide, 
our estimate of nucleotide diversity was 
8.98 X I0 4 for the Celera-PFP alignment, 
and a published estimate averaged over 10 
densely resequenced human genes was 
8.00 X IO7 4 (108). 



6.4 Variation in nucleotide diversity 
across the human genome 

Such an apparently high degree of variabil- 
ity among chromosomes . in SNP density 
raises the question of whether there is het- 
erogeneity at a finer scale within chromo- ' 
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. Tabte 16. Summary of nucleotide changes In different SNP data sets. 
' SNP data set 
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Fig. 13. Segmental duplica- 
tions between chromo- 
somes In the human ge- 
nome. The 24 panels show 
the 1077 duplicated blocks 
of genes, containing 10,310 
pairs of genes in total Each 
One represents a pair of ho- 
» rhologous genes belonging 
. to a block; all blocks con- 
tain at least three genes 
oh each of the chromo- 
somes where they appear. 
Each panel shows all the 
- duplications between a* 
' single chromosome and 
other chromosomes with 
shared blocks. The chro- 
mosome at the center of 
each panel is shown as a 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
. torn within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplka- 
tion between chromo- - 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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somes, and whether this heterogeneity is 
greater than expected bechance. If SNPs 
occur by random and independent mutations, 
then it would seem that there ought to be a 
Poisson distribution of numbers of SNPs in 
fragments of arbitrary constant size. The ob- 
served dispersion in the distribution of SNPs 
in 100-kbp fragments was far greater than 
predicted from a Poisson distribution (Fig. 
14). However, this simplistic model ignores, 
the different recombination rates and popula- =■ 
tion histories that exist in different regions of 
the genome. Population genetics theory holds 
that we can account for this variation with a ■ 
mathematical formulation called the neutral -. 
coalescent {109): Applying well-tested algo-. 
rithms for simulating the neutral coalescent : 
with recombination (110), and using an ef- 
fective population size of 10,000 and a per- 
base recombination rate equal to the mutation ■ 
rate (lll) t we generated a distribution of num- 
bers of SNPs by this model as well (112). The 
observed distribution of SNPs has a much larg- . 
er variance than either the Poisson model or the 
coalescent model, and the difference is highly 
significant This implies that there is significant 
variability across the genome in SNP density, 
an observation that begs an explanation: 

■ Several attributes of the DNA sequence 
may affect the local density of SNPs, in-** '' 
eluding the rate at which DNA polymerase 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C- - 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation. rate of CpGs over other dinucle-. . 



o tide's. We tallied the GC content and nu- 
cleotide diversities in 100-kbp windows 
across the entire genome and found that the 
correlation between them was positive (r = 
0.21) and highly significant (P < 0.0001), 
.but G+C content accounted for only a 
small part of the variation. 

, 6.5 SNPs by genomic class 
To test homogeneity of SNP densities 
across functional classes, we partitioned 
sites into intergenic (defined as >5 kbp 
from any predicted transcription unit), 5'- 
UTR, exonic . (missense and silent), in- 
tronic, and 3'-UTR for 10,239 knownv 
genes, derived from the NCBI RefSeq da- 
. tabase and all human genes predicted from 
the Celera Otto annotation. In coding re- 
gions, SNPs were categorized as either si- 
lent, for those that do not change amino 
acid sequence, or missense, for those that 
change the protein product. The ratio of 
missense to silent coding SNPs in Celera- 
PFP, TSC, and Kwok sets (1.12, 0.91, and 
0.78, respectively) shows a markedly re- 
duced frequency of missense variants com- .* 
pared with the neutral expectation, consis- 
tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino 
acid changes (112). These ratios are com- 
parable to the missense-to-silent ratios of . 
.0.88 and 1.17,found by Cargill et aL (101) : 
and by Halushka et aL (102). Similar re- 
sults were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of 
SNPs that lead to potentially dysfunctional 
alterations in proteins. In the 10,239 Ref 7 
Seq genes, missense SNPs were only about : 
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Fig. 14. SNP density In each 100-kbp Interval as determined with Cetera-PFP SNPs. The color codes 
are as follows: black, Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome Is nonrandom and is not entirely 
accounted for by a coalescent model of regional history. 



0.12, 0.14, and 0.17% of the total SNP 
counts in Celera-PFP, TSC, and Kwok 
SNPs, respectively. Nonconservative pro- 
tein changes constitute an even smaller frac- 
tion of missense SNPs (47, 41, and 40% in 
Celera-PFP, Kwok, and .TSC). Intergenic re- 
gions have been virtually unstudied (113), and 
. • we note that 75% of the SNPs we identified 
were intergenic (Table 17). The SNP rate was 
highest in introns and lowest in exons. The SNP 
.rate was lower in intergenic regions than in 
introns, providing one of the first discriminators 

• between these two classes of DNA. These SNP 
rates were confirmed in the Celera SNPs, which 

.. also exhibited a lower rate in exons than in 
" introns, and in extragenic regions than in in- 
. trons (46). Many of these intergenic SNPs will 

provide valuable information in the form of 
■■■ . markers for linkage and association studies, and 

some fraction is likely to have a regulatory 

function as well. 

7 An Overview of the Predicted 
Protein-Coding Genes in the Human 
Genome 

Summary. This section provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 

• prominent differences : and • similarities 
when the human genome is compared with 
other fully, sequenced eukaryotic genomes. 

, Over 40%- of the predicted protein set in 
humans cannot be ascribed a molecular 
: function by methods that assign proteins to 
known families. A protein domain-based 
analysis provides a detailed catalog of the 
prominent differences in the human ge- 
nome when compared with the fly and 
.worm genomes. Prominent among these are 
domain expansions in proteins involved in 
developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
predictions with at least two lines of evidence 
as described above. The first method was 
based on an analysis at the level of protein 
families, with both the publicly available 
Pfam database (114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) (116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (115, 117). 

The results presented here are preU j 
nary and are subject to several limitr / 
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Both the gene predictions and functional 
assignments have been made by using com- 
putational tools, although the statistical 
models in Panther, Pfam, and SMART have 
been built, annotated, and reviewed- by ex- 
pert biologists. In the set of computationally 
predicted genes, we expect both false-positive 
predictions (some of these may in fact be inac- 
tive pseudogenes) and false-negative predic- 
-• tions (some human genes will hot be. computa- 
tionally predicted). We also, expect errors in 
delimiting the boundaries of exons and genes. 
Similarly, in the automatic functional assign- 
ments, we also expect both false-positive and 
false-negative predictions. The functional as- 
signment protocol focuses on protein families 
that tend to be found across several organisms, 
or on families of known human genes. There- 
fore, we do not assign a function to many genes 
that are not in large families, even if the func- 
tion is known. Unless otherwise specified, all 
enumeration of the genes in any given family or 
functional category was taken. from the set of. 
26,588 predicted proteins, which were assigned 
functions by using statistical score cutoffs de- 
fined for models in Panther, Pfam, and 
SMART. 

For this initial examination of the pre- 
dicted human protein set, three broad ques- 
tions were asked: (i) What are the likely 
molecular functions of the predicted gene - 
products, and how are these proteins cate- * 
gorizeo* with current classification meth- 
ods? (ii) .What are the core functions that/-- 
appear to be common across the animals? 



(iii) How does the human protein comple- 
ment differ from that of other sequenced 
eukaryotes? 

7.1 Molecular functions of predicted 
human proteins 

Figure 15 shows an overview of the puta- 
tive molecular functions of the predicted 
26,588 human proteins that have .ar least 
two lines of. supporting' evidence. About 
41% (12,809) of the gene products could 
not be classified from this initial analysis 
and are termed* proteins with ^unknown 
functions. Because our automatic classifi- 
cation methods treat only relatively large 
protein families, there are a number of 
"unclassified'* sequences that do, in fact, 
have a known or predicted function. For the 
60% of the protein set that have automatic 
functional predictions, the specific protein 
functions have been placed into broad 
classes. We focus here on molecular func- 
tion (rather than higher order cellular pro- 
cesses) in order to classify as many proteins 
as possible. These functional predictions 
are based on similarity to sequences oT 
known function. 

In our analysis of the 12,731 additional low- 
confidence predicted genes (those with only 
one piece of supporting evidence), only 636 
-(5%) of these additional putative genes were 
-assigned molecular functions by the automated 
.methods.' One-third of these 636 predicted 
genes represented endogenous retroviral pro- 
teins, . further suggesting that the majority of . 



these iul"<ti own-function genes are not real 
genes. Given that most of these additional 
. 12,095 genes appear to be unique among the 
. genomes sequenced to date, many may simply 
■ represent false-positive gene predictions. 

The most common molecular functions are 
the transcription factors and those involved in 
nucleic acid metabolism (nucleic acid enzyme). 

• Other functions that are highly represented in 

• the human genome are the receptors, kinases, . 
• and hydrolases. Not suiprisingly/'most of the 
hydrolases are proteases. There are also many 
proteins that are. members of protp-oncogene 
families, as well as families of "select regula- 
tory molecules": (i) proteins involved in specif- 
ic steps of signal transduction such as hetero- 
trimeric GTP-binding proteins (G proteins) and 

.. cell cycle regulators, and (ii) proteins that mod- 
ulate the activity of kinases, G proteins, and 
phosphatases. 

Table 17. Distribution of SNPs In classes of 
genomic regions. 
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Fig. 15. Distribution 
of the • molecular 
functions of 26,383 
human genes. Each 
slice lists the num- 
bers and percentages 
(in parentheses) of 
human gene functions 
assigned to a given, 
category of molecular 
function. The outer cir- 
cle shows the assign- 
ment to molecular 
function categories in 
the Gene *" Ontology 
(CO) {179). and the 
Inner . circle shows 
the assignment to 
Cetera's Panther mo- 
lecular function cate- 
gories (776). 
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7.2 Evolutionary conservation of core 
processes 

Because of the various ."model organism" 
genome-sequencing projects that have al- 
ready been completed, reasonable compara- 
tive information is available for beg innin g the 
analysis of .the evolution of the human ge- 
nome. The genomes of S. cerevisiae ("bak- 
ers* yeast") {118) and two diverse inverte- 
brates, C. elegans (a nematode worm) (119) 
and D. melanogaster (fly) (26), as well as the 
first plant genome, A. thaliana, recently com- 
pleted (92) t provide a diverse background for 
genome comparisons. 

We enumerated the "strict orthologs" con- 
served between human and fly, and between 
human and worm (Fig. 16) to address the 
question, What are the core functions that 
appear to be common across the animals?- 
The concept of orthology is important be- 
cause if two genes are orthologs, they can be 
traced by descent to the common ancestor of 
the two organisms (an "evolutionarily con- 
served protein set"), and therefore are likely 
to perform similar conserved functions in the 
different organisms. It is-critical in this anal- 
ysis to separate orthologs (a gene that appears 
in two organisms by descent from, a common 
ancestor) from paralogs (a gene that appears 
. in more than one copy in a given organism by . 
a duplication event) because paralogs may : . 
subsequently diverge in function." Following 
the yeast-worm ortholog comparison in 



(720), we identified two different cases for 
each pairwise comparison (human-fly and 
human-worm). The first case was a pair of 
genes, one from each organism, for which 
there was no other close homolog in either 
organism. These are straightforwardly identi- 
fied as orthologous, because there are no . 
■ additional members of the families that com- 
plicate separating orthologs from paralogs. 
The second case is a family of genes with 
.. more than one member in either or both of the 
organisms being compared. Chervitz et ah 
(120) deal with this case by analyzing a 
phylogenetic tree that described .the relation- 
ships between all of the sequences in both 
. organisms, and then looked for pairs of genes 
.that were nearest neighbors in the tree. If the 
nearest-neighbor pairs were from different 
organisms, those genes were presumed to be 
orthologs. We note that these nearest neigh- 
bors can often be confidently identified from ' 
pairwise sequence comparison without hav- 
ing to examine a phylogenetic tree (see leg- 
end to Fig. 16). If the nearest neighbors are 
not from different organisms, there has been 
. a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
correspondence is lost, defining an ortholog 
becomes ambiguous. For our initial compu- 
tational overview of the predicted human pro- . 
tein set, we could hot answer this question for . 
every predicted protein. Therefore, we con- 



sider only "strict orthologs," i.e., the protei; 
.with unambiguous one-to-one relationshij 
(Fig. 16). By these criteria, there are 21i 
strict human-fly orthologs, 2031 huma: 
worm (1523 in common between these sets 
We define the evolutionarily conserved set ; 
those 1523 human proteins that have stri< 
. orthologs* in both -D. melanogaster and ( 
elegans. \ ; ' 

The distribution of the functions of th 
conserved protein set is shown in Fig. It 
Comparison with Fig. 15 shows that, nc 
surprisingly, the set of conserved proteins i 
.not distributed among molecular functions i 
the same way as the whole human protein sc 
Compared with the whole human set "(Fit' 
. 15), there are several categories that are over 
represented in the conserved set by a factor o 
—2 or more. The first category is nucleic aci< 
enzymes, primarily the transcriptional ma 
chinery (notably DNA/RNA methyltrans 
ferases, DNA/RNA polymerases, helicases 
DNA ligases, DNA- and RNA-pr'ocessin< 
factors, nucleases, and ribosomal proteins) 
The basic, transcriptional and translationa 
machinery is well known to have been con 
served over evolution, from bacteria through 
to the most complex eukaryotes. Many ribo- 
nucleoproteins involved in RNA splicing also 
appear to be conserved, among the animals. 
Other enzyme types are also overrepresent- 
ed (transferases, oxidoreductases, ligases, 
lyases, and isomerases). Many of these en- 



Fig. 16. Functions of putative 
orthologs across vertebrate . 
and invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
of "strict orthologs*" between 
the human, fly, and worm ge- 
nomes involved in a given cat- 
egory of molecular' function. 
"Strict orthologs*' are defined 
here as bi-directional BLAST 
best hits faQO) such that each : • 
orthologous pair (i) has a* \ 
BIASTP P-vattte of rsicr 1 ? 
(720), and (ii) has-a'more sig- 
nificant BLASTTP ' score than 
any paralogs' in either orgarw- 
Ism, !.e, there has likely been 
.no duplication subsequent/ to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the'number of . 
orthologsl . By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-worm orthologs (1523 in 
common between these sets). 
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viral protein (4. 02*/.), 
transfer/carrier protein (II, 0.6%) 

transcription factor (8 1, 4. 7%) 



nucleic acid cruymc (22!. 12,924) 



receptor (23, 1 J 0 /.) 



kinase (69, 4.0%) 



select regulatory molecule (88, 5.1%) 



extracellular marix (12, 0.7%) 
ion channel (7. 0.4%) 
motor < 13. 0.8%) 

structural protein of muscle (8, 0.5%) 
protooncogene (23. 1 J%) 

intracellular transporter (51. 3.0%) 

transporter (44, 2.6%) 




transferase (70,4.1%) 

synthase and synthetase (64, 3.7%) 

cocidorcduclase (64. 3.7%) 

base (12. 0.7%) 
0S2Sc(9,O.5%) 



molccutar function unknown (613. 35.8%) 



hydrolase (80.4.7%) 
isom erase (21, 12%) 
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zymes are involved in intermediary metabo- 
lism. The only exception is the hydrolase 
category, which is not significantly overrep- 
resented in the shared protein set. Proteases 
form the largest part of this category, and 
several large protease families have expanded 
in each of these three organisms after their 
divergence. The category of select regulatory 
molecules is also oyerrepresented in the con- 
served set. The major conserved families are 



the Human genome 

in development and homeostasis; (iv) he mo- 
stasis; and (v) apoptosis. 

Acquired immunity. One. of the most 
striking differences between the human ge- 
nome and the Drosophila or C. elegans ge- 
.nome is the appearance of genes involved in 
acquired imrnunity (Tables 18 and 19). This 
is expected, because the acquired immune 
response is a defense system that only occurs . 
.in vertebrates. We observe 22 class* I and 22 



of prominent protein expansions are in- 
volved in the processes of neural develop- 
♦ ment. Of the extracellular domains that me- 
diate cell adhesion, the connexin domain- 
containing proteins (122) exist only in hu- 
mans. These proteins, which are not present 
in the Drosophila or C. elegans genomes, 
appear to provide the constitutive subunits' 
of intercellular channels and the structural 
basis' for electrical coupiing.^Pathway find- 



,, . . ' . • • - ~ ~ ■ — 7 ■ . \\y ~r r. — - ~~ .. : uaoia iwi cictiutai coupimg. ramway una- 

small guanosme tnphosphatoes (GTPases).;. class. II major.-lhistocorapatibility compleiV. irig by axons and neuronal network forma-' 



(especially the Ras-related superfamily, in- 
cluding ADP ribosylation ' factor) and cell 
cycle regulators (particularly the cullin fam- 
ily, cyclin C family, and several cell division 
protein kinases). The last two significantly 
overrepresented categories are protein trans- 



(MHC) antigen genes and 114 other immu- 
noglobulin genes in the human genome. In 
addition, there are 59 genes in the cognate 
immunoglobulin receptor family. At the do- 
main level, this is exemplified by an expan- 
sion and recruitment of the ancient immuno- 



port and trafficking, and chaperones. The. globulin fold to. constitute molecules such as 



most conserved groups in these categories are 
proteins involved in coated vesicle-mediated 
transport, and chaperones involved in protein 
folding and heat-shock response [particularly 
the DNAJ family, and heat-shock protein 
60 (HSP60), HSP70, and HSP90 families]. 
These observations provide only a conserva- 
tive estimate of the protein families in the 
context of specific cellular processes that 
were likely derived from the last common 
ancestor of the human, fly, and worm. As 
stated before, this analysis does not provide a 



MHC, and of the integrin fold to form several 
of the cell adhesion molecules that mediate 
interactions between immune effector cells 
and the extracellular matrix. Vertebrate-spe- 
cific proteins include the paracrine immune 
regulators family, of secreted 4-alpha helical 
bundle proteins, namely the cytokines and 
chemokines. Some of the cytoplasmic sigrial 
transduction components associated with cy- 
tokine receptor signal transduction are also 
features that are poorly represented in the fly 
and worm. These include .protein domains 



complete estimate of conservation across the - found in the signal transducer and activator of 

- transcription (STATs), the suppressors of cy- 
tokine signaling (SOCS), and protein inhibi- 
tors of activated STATs (PIAS). In contrast, 
many of the animal-specific protein domains 
that play a role in innate immune response, 
such as the Toll receptors, do not appear to be 
significantly expanded in the human genome. 
Neural development, structure, and 
. function. In the human genome, as compared 
with the worm and fly genomes, there is a 
marked increase in the number of members 
of protein , families that are involved in 
neural development. Examples include neu- 
rotrophic factors such as ependymin, nerve, 
growth factor, and signaling molecules 
such as semaphorins, a§ well as the number 
of -proteins invofVecl directly in neural 
structure and function such, as myelin pro- 
Veins, voltage-gated ion channels, and syn- 
aptic* proteins such as synaptotagmin.* 
These observations correlate well with the 



three animal genomes, as paralogous dupli- 
cation makes the determination of true or- 
thologs difficult within the members of con-, 
served protein families. * 

7.3 Differences between the human 
genome and other sequenced 
eukaryotic genomes 
To explore the molecular building blocks of 
the vertebrate taxon, we have compared the 
human genome with the other sequenced 
eukaryotic genomes at three levels>molec- 
ular functions, protein families, and protein 
domains.- ... , 

Molecular differences can be correlated 
with phenotypic differences to begin to reveal 
the developmental and cellular processes .that 
are unique to the vertebrates. Tables 18"and 
19 display a comparison among all sequenced 
eukaryotic genomes, oyer, selected protein/ 
domain families (defined Ky sequence sjrai- 



lanty, e.g., the serine-threonine protein ki- . known phenotypic differences between the 
nases) and superfamilies (defined by/shared 
molecular function, which may include sev- 
eral sequence-related families, e.g., the cyto- 
kines). In these tables we have focused on 
(super) families that are either very large or 
that differ significantly in humans compared 
with the other sequenced eukaryofe genomes. 
We have found that the most prominent hu- 
man expansions are in proteins involved in (i) 
acquired immune functions; (ii) neural devel- 
opment, structure, and functions; (iii) inter- 
cellular and intracellular signaling pathways 



nervous systems of these taxa, notably (i) 
the increase in the number and connectivity 
of neurons; (ii) the increase in number of* 
distinct neural cell types (as many as a 
thousand or more in human compared with 
a few hundred in fly and worm) {12 1); (iii) 
the increased length of individual axons; 
and (iv) the significant increase in glial cell 
number, especially the appearance of my- 
elinating glial cells, which are electrically- 
inert supporting cells differentiated from 
the same stem cells as neurons. A number 



tion is.mediated through a subset of ephrins 
- and their cognate receptor tyrosine kinases, 
that act as positional labels to establish 
topographical projections* (123). The prob- 
able biological role for the semaphorins (22 
in human compared with 6 in the fly and 2- 
in the worm) and their. receptors (neuropi- 
lins and plexins) is that of axonal guidance 
molecules (124). Signaling molecules such 
as neurotrophic factors and some cytokines 
have been shown to regulate neuronal cell 
survival, proliferation, and axon guidance 
(125). Notch receptors and Iigands play 
important roles in glial cell fate determina- 
tion and gliogenesis (126). 

Other human expanded gene families play 
key roles directly in neural structure and 
function. One example is synaptotagmin (ex- 
panded more than twofold in humans relative • 
to the .invertebrates), originally found to reg- 
ulate synaptic transmission by serving as a 
Ca 2+ sensor (or receptor) during, synaptic 
•-: vesicle fusion and release (127). Of interest is 
.'. the - increased . co-occurrence in humans of 
". PDZ and the SH3 domains in neuronal- 
. specific adaptor molecules; examples include 
. proteins that likely modulate channel activity 
at synaptic junctions (128).. We also noted 
expansions in several ion-channel families 
(Table 19), including the EAG subfamily 
= (related to cyclic nucleotide gated channels), 
the voltage-gated calcium/sodium channel 
family, the inward -rectifier potassium chan- 
nel family, and the. voltage-gated potassium 
channel, alpha subunit family. Voltage-gated 
sodium and potassium channels are involved 
in the generation of action potentials in neu- 
rons. Together with voltage-gated calcium 
channels, they also play a key role in cou- 
pling action potentials to neurotransmitter re- 
lease, in the development of neurites, and in ■ 
short-term memory. The recent observation 
of a calcium-regulated association between 
sodium channels and synaptotagmin may 
have consequences for the establishment and 
regulation of neuronal excitability (129). 

Myelin basic protein and myelin-associat- 
ed glycoprotein are major classes of protein 
components in both the central and peripheral 
nervous system of vertebrates. Myelin PO is a 
major component of peripheral myelin, and 
myelin proteolipid and myelin oligodendro- 
cyte glycopotein are found in the central 
nervous system. Mutations in any of these 
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Table 18. Domain-based comparative analysis of proteins in H. sapiens (H). 

™*™°9*ster (F), C elegans (W), S. cerev!siae (Y), and A thaliana (A). The 
pred.cted protein set of each of the above eukaryotic organisms was analyzed 
with Pfam version 5.5 using E value cutoffs of 0.001. The number of proteins 
containing the specified Pfam domains as well as the total number of domains 
(in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed In 



lEJtfSS ° % t J P RCSUlt5 ° f the P/am anal vsis may differ from 

£S?£ ?1 baSed ,° n hUman CUrat?0n of P foteJ n families, owing to I the 
limitations o large-scale automatic classifications. Representative efamDles 

thifanl n c- W,th red Tl C ° U , ntS t0 the * rf Wnrt value cutofl ZsTdllr 
this analysis are marked with a doub e asterisk Fy amft i« , J / , 

divergent and pre^nantty a.pha W^^SffihS^ 
cysteme-rich zinc finger proteins. "« 5 es or 



Accession 
number 



Domain name 




Domain description 



H 




PF02039 
PF00212 
PF00028 
PF00214 
PF01110 
PF01093 
• PF00029 
PF00976 
PF00473 
PF00007 
PF00778 
PF00322 
PF00812 
PF01404 
PF00167 
. PF01534 
PF00236 
PF01153 
PF01271 
PFO2058 
PF00049 
PF00219 
PF02024 
PF00193 
PF00243 
PF02158 
PF00184 
PFO2O70 
PF00066 
PF00865 
PF00159 
PF01279 
PF00123 
PF00341 
PF01403 
PF01033 
PF00103 
PF02208 
PF02404 
PF01034 
PF00020 
PF00019 , 
PF01099 . 
PF01160 
PF00110 



Adrenorriedullin 

ANP 

Cadherin 
- Catc_CGRPJAPP 

CNTF 
^-Clusterin ' 

Connexin 

ACTH_domain 
* CRF 
Cysjcnot 
DJX 

Endothelin 

Ephrin ■ 

EPh Ibd 

FCF 

Frizzled 

HormoneS 

Clypican w w .- " " 

Granin 

Guanylin 

Insulin 

IGFBP 

Leptin 

Xlink 

NGF 

Neuregulin 
Hormone5 
NMU 
Notch 

Osteopontin 

Hormone3 

Parathyroid 

Hormone2 

PDGF 

Sema 

Somatomedin_B- ~ 

Hormone 

Sorb 

SCF 

Syndecan 

TNFR_c6 

TGF-p 

Uteroglobin 
Opfods./ieuropep 

Wnt : 



PF01821 
PF00386 
PF00200 
PF00754 
PF01410 
.PF00039 
PF00040 
PF00051 
PF01823 
PF00354 
PF00277 
PF00084 
PF02210 
PF01108 
PF00868 
PF00927 



ANATO 
C1q 

Disintegrin 

F5_F8_type C 

COLFI 

Fn1 

Fn2 

Kringle 

MACPF ' 

Pentaxin 

SAAuprotelns 

Sushi 

TSPN 

T1ssue_fac 

Transglutamin_N 

TransgIutamin_C 



• Developmental and homeosiatic 

Adrenomedullin 
Atrial natriuretic peptide 
Cadherin domain 
Calcitonin/CCRP/JAPP family 
. Ciliary neurotrophic factor 
Clusterin 
Connexin 

Corticotropin ACTH domain 
Corticotropm-releasing factor family 
Cystine-knot domain 
Dix domain 
Endothelin family 
Ephrin 

Ephrin receptor ligand binding domain 
Fibroblast growth factor 
Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Glypican 

- - Grainin (chromogranin or secretogranin) 
Guanylin precursor 
fnsulin/JGF/Relaxin family 
Insulin-like growth factor binding proteins 
Leptin'- * ' 

LINK (hyaluron binding) 
Nerve "growth factor family 
Neuregulin family 
Neurohypophysial hormones " 
Neuromedin U » ' - 

Notch (DSL) domain 
Osteopontin 

Pancreatic hormone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) 
Sema domain 
Somatomedin B domain 
Somatotropin - * " 

Sorbin homologous domain 
Stem cell factor 

Syndecan domain - " 

TNFR/NGFR cysteine-ricH .region^ 
* Transforming growth factor p-like domain 
Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt familyof developmental signaling proteins 

^ - . + Hemostasts 

Anaphylotoxin-like domain 
C1q domain ' '^ *-* 

Disintegrin . • " 

F5/8 type C domain * 

Fibrillar collagen C-termlnal domain 
Fibrqne'ctin type I domarn' * 
Fibrdnectin type II domain • 
Kringle domain \ . * ' 

MAC/Perforin domain 
Pentaxin family • 
Serum amyloid A protein 
Sushi domain (SCR repeat] 
Thrombospondin N-termlnal-like domains 
Tissue factor 
Transglutaminase family 
Transglutaminase family 



regulators 

1 
2 

. 100(550) 
3 
1 
3 

- 14(16) 
1 
2 

10(11) 
5 
3 

7(8) 
12 
23 
9 
1 

14 
3 
1 
7 

10 
1 

13(23) 
3 
4 
1 

3(5) 
1 
3 
2 

5(9) 
5 

27(29) 
5(8) 
1 
2 
2 

17(31) 
27(28) 
3 
3 
18 



6(14) 

- 24, 
18" 

- 15(20) 
. 10 
5(18) 

: 11(16)- 

15(24) 
6 
9 
4 

53 {191) 
14 
1 
6 
8 



0 
0 

14(157) 
0 
0 

: 0 

* 

0 
0 

1 . 

2 

2 

0 

2 

2 

1 

7 

0 

2 

0 

0 

* 4 
0 
0 
0 
0 
0 
0 
0 

2(4) 
0 

• o 

0 

o 
1 

8(10) 
3 

, 0 

o 
i 
i 

6 
0 
0 

7(10) 

0 
0 
2 

5(6) 
0 , 
0 

b • 



o 
o 

11(42) 
1 
0 

1 
1 



0 
0 

16(66) 
0 
0 
0 
O - 
0 
0 
0 
4 
0 
4 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 

1 

0 ' 
0 
. 0 
0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 



0 

0 

3 

2 

0 

0 

0 

2 

0 

0 • 
0 

8(45) 
0 
0 
0 
0 



0 
0 
0 
0 
0 
• 0- 

0 

0 

0 

0 

0 

0 

0 

0' 

0 

0 

0 

0 

0 

0 

0 

0 

o . 

0 

0 * 
. 0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 
0 

o 

0 . 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 
0 

o 

0/ 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

o 

0 
0 

0 
0 
0 

. 0 
. 0 

b 

0 
0 

o 
o 
o 
o 

0 

0 

0 

0 

0 

0 

0 

o 

0 
0 
0 
0 

o 



0 
0 
0 
0 

o 

0 
•0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
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Table 18 (Continued) 



the Human genome 



Accession 
number 



Domain name 



Domain description 



H 




PF00594 



PF00711 
PFOQ74S 
PF00666 
- PF00129 

PF00993 
PF00969 
PF00879 
PF01109 
PF00047 
PF00143 
PF00714 
PF00726 
PF02372 
PF00715 
PF00727 
PF02025 
PF01415 
PF00340 
PF02394 
PF02059 
PF00489 
PF01291 

PF00323 
PF01091 
PF00277 
PF00048 

PF01582 
PF00229 
PF00088 

PF00779 

PF0O168 

PF00609 

PF00781 

PF00610 

PF01363 

PF00996 ■ 

PFOO503 

PF00631 

PF00616 

PF00618 

• 

PF00625 
PF02189 
PF£0169 
PF00130 

• 

PF00388 



Defensin_beta 
Calpainjnhib 
k -Cathelicidins ••■ 
•■• .MHCJ . . 

MHCJLalpha** 
MHCJI.beta** 
Defensin_propep 
CM_CSF 

Interferon 

IFN-gamma 

IL10 

IL15 

IL2 

114 

IL5 

IL7 

IL1 

IL1_propep 

IL3 

IL6 

UF_OSM 

Defensins 
PTN_MK 

SAA^proteins 
IL8 

TIR 
.TNF 
■ Trefoil 

BTK 
C2 

DAGKa 
DAGKc 
DEP 

FYVE • 
GDI 

G-alpha . 
G-gamma 
RasCAP - 
RasGEFN 

Guanylatejcin 

n*AM f 

PH - ' * 
DAG_PE-bind j. 



PJ-PLC-X 



PF0O387 PI-PLC-Y 



- .1 



PF00640 

PF02192 

PF00794 

PF01412 

PF02196 

PF02US 

PF0O788 

PF00071 

PF00617 

PF00615 

PF02197 



P1D 

PI3iep85B 

PI3K_/bd • 
ArfGAP . 
RBD 

Rap_GAP '! 

RA 

Ras 

RasGEF 

RGS 

Rlla 



" "Vitamin K-dependent carboxylation/gamma- 
carboxyglutamic (GLA) domain 

m 

Immune response 

Beta defensin 

Calpain Inhibitor repeat • 

CatheUddins ;**. .. . - \. .. . '/ 

Class I histocompatibility antigen^ domains alpha i 

' and 2 ' " : * • •- ■ " : * * * " ■* * 

Class II histocompatibility antigen, alpha domain 
Class II histocompatibility antigen, beta domain 
Defensin propeptide 

Granulocyte-macrophage colony-stimulating factor 

Immunoglobulin domain 

Interferon alpha/beta domain 

Interferon gamma 

lnterleukm-10 

lnterleukin-1 5 

lnterleukin-2 

InterleukIn-4 

lnterleukin-5 

lnterleukin-7/9 family 

lnterleukin-1 • 

lnterieukin-1 propeptide 

lnterleukin-3 

lnterleukin-6/G-CSF/MGF family 

Leukemia Inhibitory factor (LIF)/oncostatin (OSM) 

family 
Mammalian defensin 
PTN/MK heparin-binding protein 
Serum amyloid A protein 
Small cytokines (intecrine/chemokine). 

fnterleukin-8 like 
TIR domain . * 

TNF (tumor necrosis factor) family . 
Trefoil (P-type) domain 

PI-PY-rho CTPase signaling 

BTK motif 
C2 domain 

Diacylglycerol kinase accessory domain (presumed) 
Diacylglycerol kinase catalytic domain (presumed) 
Domain found in Dishevelled, Egl-10, and 

Pleckstrin (DEP) 
FYVE zinc finger 
QDP dissociation fnhibitor 
G-protein alpha sub unit 
G-protein gamma like domains 
GTPase-actlvator protein for Ras-like CTPase 
Guanine nucleotide exchange factor for Ras-like 

GTPases; N-terminal motif 
Guanylate kinase - 

rmmunpreceptor tyrosine-based activation motif 
PH domain 

Phorbol esters/diacytglycerol binding domain (CI 
• domain) " 

PhospjiatioVtinositot-specific phospholipaseC, X 
domain 

PhosphatidytfnosTtol-spedfTc phospholipdse C Y 
domain 

Phosphotyrosine Interaction domain (PTB/PiD) 
PI3-kinase family, p85-binding domain 
PI3-kinase family, ras-blnding domain 
Putative GTP-ase activating protein for Arf 
RaMike Ras-binding doniafn 
Rap/ran-GAP 

Ras association (RalGDS/AF-6) domain 
Ras family 
RasGEF domain 

Regulator of G protein signaling domain 
Regulatory subunit of type II PKA R-subunit 



11 



..3(9) 
2 

V. - '1?( 20 ) 

: 5 (6) 
7 
3 
1 

381 (930J 
7(9) 
1 
1 
1 
1 
1 
1 
1 
7 

1 «r 
1 

2 
2 

2 
2 
4 
32 

18 

12 * 
5(6) 

5 

73 (101)' 
9 
10 
12(13) 

28 (30) 
6 

27(30) 
16 
11 
9 

.12 
3 

193 (212) 



0 
0 
0 
0 

O 

o 

0 

o 

125 (291) 
~0 
0 
0 
0 
0 
O 
0 
0 
O 
O 
0 
O 
O 

0 
0 
0 
0 

8 
0 
0 



0 
0 

■o 

* • * ■ 
* 

o 

0 
0 
0 

67 (323) 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 

2 
0 

2-' 



O 

. 0 
' 0 

. b 

* 

0 

o 

0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 

b 

0 
0 



0 
0 

. 0 

o . 

0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 



131 (143) 
0 
0 



1 


. .0 


0 


0 


32(44) 


24 (35) 


6(9) 


66(90) 


4 


7 


0 


6 


8 


8 


2 


11(12) 


4 


10 


5 


2 


: 14 


15 


5 


-■'i 15 


2 


1 


1 


3 


10 


20 (23) 


2 


5 


5 . 


5 


1 


0 


5 


8 


3 


0 


2 


3 


5 


0 


8 


7 


1 


4 


0 


0 


0 


0 



45(56) 


25(31) 


26(40) 


1(2) * 


..' 4 


12 


3 


7 


1 


fi 


11 

* 

* 


2 


7 


1 


8 


24(27) 


13 


11(12) 


0 


0 


" 2 


1 


1 


0 


0 


6 


3 


1 


0 


0 


16 


9 


8 


*6 


15 


6(7) 


4 


1 


0 


0 


5 


4 . 


2 


0 


0 


18(19) 


7(9) 


6 


1 


0 


126 


56(57) 


51 


23 


78 


2.1 


8 




5 


0 


27 


6(7) 


' 12(13) 


1 


0 


4 


1 


2 


1 


0 
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Accession 
number 



Domain name" 



Domain description 



H 




PF00620 


RhoGAP 


Pf 00621 


RhoGEF 


PF00536 


SAM 


PF01369 


Sec7 


PrCKJQl / 


5HZ 


rrULKJ Jo 




PF01017 


STAT 


PF00790 


VHS 


PF00568 


WH1 


PF00452 


Bcl-2 


PF02180 


- BH4 


PF00619 


. CARD 


PF0O531 


Death 


PF01335 


DED 


PF02179 


BAG 


PF006S6 


ICE_p20 


PF00653 


BIR 


PF00022 


Actin 


PF00191 


Ann ex in 


PF0O4O2 


Calponin 


PF00373 


Band_41 



PF00880 
. PF0O681 
PF00435 
PF00418 
PF00992 
PF02209 
PF01044 

PF01391 
' PF01413 

PF00431 
PF00008 
PF00147 

PF00041 
PF00757 
PF0O357 
PF00362 
PF00052 
PF00053 
PF00054 
PF0005S 
PF00059 
PF01463 
PF01462 
PF00057 
PF00058 
PF0O530 
PF00084 
PF00090 * 
PF00092 
PF00093 
PF00094 
• 

PF00244 . 
PF0OO23 
PF00514 
.PF00168 
PF00027 
PF015S6 
PF00226 
PF00036 
PF00611 
PF01846 
PF00498 



Nebulin_repeat 

Ptectin_repeat 

Spectrin 

Tubulin-binding 

Troponin 

VHP * 

Vinculin 

Collagen 

• C4 

CUB 
EGF 

Fibrinogen_C 
Fn3 

Furin-like 
lntegrin__A 
Integrin_B 
Laminin_B 

Laminin_EGF 
Laminin_G 
Uminin_Nterm 
Lectin_c 
♦LRRCT 
. -LRRNT, 
LdL.fecept._a, 
Ldl_.recept.lb 
SRCR.' 
Sushi 
tspjl 
Vwa 
Vwc 
Vwd 

* * • - 

14-3-3 
Artk 

Armadillo_seg 
C2 

cNMP_binding 

DnaJ_C 

DnaJ 

Efhand" 

FCH 

FF 

FHA 



RhoGAP domain 
RhoGEF domain 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAT protein 
! VHS domain 
WHIdomain 

Domains Involved in apoptosis 
Bcl-2 ; 
Bcl-2 homology region 4 
. Caspase recruitment domain 
Death domain 
Death effector domain 
Domain present In Hsp70 regulators 
ICE-like protease (caspase) p20 domain 
Inhibitor of Apoptosis domain 

* 

. Cytoskeletat 

Actin 
Annexin 
Calponin family 

FERM domain (Band 4.1 family) 
■ Nebulin repeat 
Plectin repeat 
Spectrin repeat 

Tau and MAP proteins, tubulin-binding 
Troponin 

VHlin headpiece domain 
Vinculin family . 

" ■ ■ ECM adhesion 

Collagen tripte helix repeat (20 copies) 
C-terminal tandem repeated domain in type 4 

procollagen 
CUB domain 
EGF-Hke domain 

Fibrinogen beta and gamma chains, C-terminal 

globular domain 
Fibronectin type III domain 
Furin-like cysteine rich region 
Integrin alpha cytoplasmic region 
Integrins, beta chain • • 

Laminin B (Domain IV) - 
laminin EGF-like (Domains lll'and V) 
Laminin G domain 
Laminin N-terminal (Domain VI)" 
Lectin C-type domain t m 4 . '. '\ • ■ 
; /Leucine rich repeat C-terminal domain 
. .-- Leucine rich repeat N-terminal domain ' ' 
* Low-density lipoprotein receptor domain class A 

Low-density Dpoprotein receptor repeat class B 
- Scavenger receptor cyst eine-ricf) domain 
"Sushi domain (SCR repeat) « 

, Thrombospondin type 1 domain, 
' ' von Willebrand factor type A domain - • 

von Willebrand factor type C domain 
von Willebrand factor .type D domain - 

• . , Protein interaction domains 

14-3-3 proteins; 
Ank repeat */*,.' 
Armadillo/beta-catenin-Uke repeats 
C2 domain 

Cyclic nucleotide-binding domain 
DnaJ C terminal region 
DnaJ domain 
EFhand 

Fes/CIP4 homology domain 
FF domain 
FHA domain 



59 
46 
29(31) 
13 

. 87(95) 
143 (182) 
7 
4 
7 

v 9 
• . ' - 3 
• 16 
16 
4(5) 
5(8) 
11 
8(14) 

61^64) 
16(55) 
13(22) 
29 (30) 
4(148) 

2(11) 
31(195) 
4(12) 
4 
5 
4 

65(279)- 
■6(11) 



-47(69) 
108(420) 
26 

106 (545) 
5 
3 
8 

8(12) 
24 (126) 
30(57) 
10 
47 (76) 
69 (81) 

40 (44) 
35(127) 

15(96) 
.11(46) 
53(191) 

41 (66) 
34(58)* 
19(28) 
15(35) 



20 

145 (404) 
22 (56) 
73 (101) 
26(31) 

. 12 
44 

83 (151) 
9 

4(11) 
13 



19 
23(24) 
15 
5 

33 (39) 
55(75) 
1 
2 
2 

2 
0 
0 
5 
0 
3 

5(9) 

15(16) 
4(16) 
3 

17(19) 

1(2) 
0 

13(171) 
1(4) 
6 
2 
2 



/ 10(46) 
2(4) 

9 (47) 
45(186) 
10(11) 

42 (168) 
2 
1 
2 

' ' 4(7) 

18(42) 
6 

23 (24) 
23(30) 

703) 
33(152) 
9 (56) 
4(8) 
11 (42) 
.11(23) 
0 . 

6(11) 
.3(7) 

3 

72 (269) 
11(38) - 
32(44) 
21 (33) 
9 
34 

64(117) 
3 

4 (10) 
15 



20 
18(19) 
8 
5 

44(48) 
46 (61) 
1(2) 
4 

2(3) 

1 
1 

-2 . 
7 
0 
2 
3 

2(3) 

12 
4(11) 
7(19) 
11(14) 
1 
O 

10(93) 
2(8) 
8 
2 

1. 



.174(384) 
3(6) 

43 (67) 
54(157) 
6 

34(156) 
1 
2 
2 

6(10) 
11(65) 
14(26) 
4 

91 (132) 
7(9) 
3(6) 

27(113) 
7(22) 

1(2) 
8(45) 
18(47) 
17(19) 
2(5) 
• 9 

3 

75 (223) 

3(11) 
24(35) 
15(20) 
5 
33 
41(86) 
2 

3(16) 
7 



9 
3 
3 
5 
1 

23 (27) 
0 
4 
1 

O 
. 0 
0 
0 
0 

1 

0 

1(2) 

9fr11) 
0 
0 
0 
0 
0 
0 
0 
0 

o 

0 

. 0 
0 

0 
0 
0 

0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

d 

0 
0 
0 
0 
0 



2 

12(20) 
2(10) 
6(9) 
2(3) 
3 
20 

4(11) 
4 

2(5) 
13(14) 



8 
0 
• 6 
9 
3 
4 
0 
8 
0 

O 
.0 
0 
0 
0 
5 
0 
0 



24 
6(16) 
0 
0 
0 
0 

o 

0 
0 
5 

0. 

0 . 
0 

0 

1 

0 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

6 
o 

0 
0 

1 

0 
0 



15 

66 (111) 
25(67) 
\ 66 (90) 
22 
19 
93 

120 (328) 
0 

-.; 4(8) 
17 
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myelin proteins result in severe demyelina- 
tion, which is a pathological condition in 
which the myelin is lost and the nerve con- 
duction is severely impaired (J 30). Humans 
have at least 10 genes belonging to four 
different families involved in myeTin produc- 

Table 18 {Continued) 



The Human genome 

tion (five myelin P0, three myelin proteolip- 
id, myelin basic protein, and myelin-oligo- 
dendrocyte glycoprotein, or MOG), and pos- 
sibly more-remotely. related members of the 
MOG family. Flies have only a single myelin 
proteolipid, and worms have none at all. 



Intercellular and intracellular signaling 
pathways in development and homeostasis. 
Many protein families that have expanded in 
•. humans relative to the invertebrates are in- 
volved in signaling processes, particularly in 
response to development and differentiation 



Accession 
number. 



Domain name 



. PF00254 
PF01590 
PF01344 
PF00560 
PF00917 
PF00989 
PF00595 
PF0O169 
PF01535 
PF00536 
PF01369 
PF00017 
PF00018 
PF01740 
PF00515 
PF00400 
PF00397 
PF00569 

PF01754 

PF01388 

PF01426 

PF00643 

PF00533 

PF00439 

PF00651 

PF00145' 

PF0O385 

PF00125 

PF0O134 

PF00270 

PF01529 

PF00S46 

PF0O250 

PF00320 

PF01585 

PF00010 

PF00850 

PF00046 

PF01833 

PF02373 

PF02375 

PSQ0013 

PF01352 

PF00104 

PF00412 

PF00917 

PF00249 ' 

PF02344 

PF01753 

PF00628 

PF00157 

PF02257 

PF00076 

PF02037 
PF00622 
PF01852 
PF00907 



FKBP 

GAF . 

Ketch 

LRR** 

MATH 

PAS 

PDZ 

PH 

PPR" . 

SAM 

Sec7 

SH2 

SH3 

STAS 

TPR** 

WD 40** 

WW 

21 

Zf-A20 

ARID 

BAH 

Zf-Bj>ox" 
. BRCT . 

Bromodomain 
BTB 

DNA_methylase 
Chromo 

Histone 

Cyclin 

DEAD 

Zf-DHHC 

F-box* # 

Fork^head .. 

GATA 

G-patch 

HLH** 

Hist_deacetyl ^ - 
Homeobox 
TIG 
JmjC 
JmjN'' 
KH-domain, - 
' KRAB 

Hormone^eo • . 

*■ .■ * 

UM 
MATH 

MybJDNA-binding 

Myc-LZ 

Zf-MYND 

PHD 

Pou 

Rf*J>NAJ>inding 
Rnm ' 

SAP' 
SPRY 
START 
T-box 



FKBP-type peptidyl-protyl cis-trans Isomerases 
GAF domain 
Ketch motif 
Leucine Rich Repeat 
MATH domain 
PAS domain 

PDZ domain (Also Known as DHR or GLGF) 
PH domain 
PPR repeat 

SAM domain (Sterile alpha motif) 
Sec7 domain 

Src homology 2 (SH2) domain 
Src homology 3 (SH3) domain 
STAS domain 
TPR domain 
WD40 domain 
WW domain 

22-Zinc finger present in dystrophin, CBP/p300 

Nuclear Interaction domains 

"A20-like zinc finger 
. ARID DNA binding domain 
BAH domain . 

B-box zinc finger - . 

BRCA1 C Terminus (BRCT) domain 
Bromodomain " 

BTB/POZ domain * ; 

: C-5 cytosine-specific DNA methylase 

chromo' (CHRromatin Organization Modifier) 
domain 

. Core histone H2A/H28/H3/H4 
Cyctfn 

DEAD/DEAH box helicase . 
DHHC zinc finger domain 
F-box domain 
Fork head domain 
GATA zinc finger 
C-patch domain 

HeUx-loop-helix DNA-bindtng domain 
Histone deaceryUse family . 
Homeobox domain 
IPT/TIG domain 
JmjC domain ... - 
jmjN' domain 
KH-rfdmam - 
KRAB box 

Ugand-binding domain of nudear hormone 
receptor . "-' 
UM*3omaih containing proteins 
•MATH domain » ' 

Myb-like DNA-bindtng domain 
Myc leucine zipper domain 
MYND finger 
PHD-finger ^. 

Pou domalrh— N-terminal to homeobox domain 
RFX DNA-blnding domain/' 
RNA recognition motif (a.lca. RRM, RBD, or RNP 

domain) 
SAP domain 
SPRY domain 
START domain 
T-box 



15(20) 
7(8) 
54 (157) 
25(30) 
11 

18(19) 
96(154) 
193 (212) 
5 

29 (31) 
13 

87(95) 
143 (182) 
5 

72 (131) 
136 (305) 
32 (53) 
10(11) 

2(8) 
11 
8(10) 
32 (35) 
17(28) 
37(48) 
97 (98) 
3(4) 
24(27) 

75(81)-, 

19 
63 (66) 
15 
16 
35 (36) 
11(17) 
18 
60 (61) 
12 

160(178) 
29 (53) 
10 
7 

28(67) 
204(243) 
47 

62(129) 

•32(43) 
. 1 
14 
68(86) 
15 
7 

224 (324) 
15 

44(51). 

TO 
17(19) 



• ■ • 


* 


• • 

.T- 


; . ; ,., A • •• 


7(8) 


7(13) 


* 4 


24 (29) 


2(4) 


1 


0 


10 


12(48) 


13(41) 


3 


102 (178) 


24 (30) 


7(11) 


1 


15(16) 


5 


88 (161) 


1 


61 (74) 


9(10) 


6 


1 


13(18) 


60 (87) 


46(66) 


2 


5 


72(78) 


65 (68) 


24 


23 




u 


1 


474 (2485) 


15 


8 


3 


6 


5 


5 


5 


9 


33 (39) 


44 (48) 


1 


3 


55 (75) 


46 (61) 


23 (27) 


4 


1 


6 


2 


13 


^ 39(101) 


28 (54) 


16(31) 


65(124) 


98(226) 


72(153) 


56(121) 


167(344) 


' 24 (39) 


16(24) 


5(8) 


11(15) 


13 


10 


2 


. 10 


2 


2 


0 


8 


6 


4 


2 


7 


7(8) 


♦ 4(5) 


5 


21 (25) 


1 


2 


0 


o 


10(18) 


23 (35) 


10(16) 


12(16) 


16(22) 


18(26) 


10(15) 


28 


62(64) 


86(91) 




30 (31) 


mt 

1 




0 


13(15) 


14(15) 


17(18) 


1 (2) 


12 


5 


71 (73) 


8 


48 


10 


10 


11 


35 


48(50) 


55 (57) 


50 (52) 


84 (87) 


20 


16 


7 


22 


' 15 


309 (324) 


9 


165 (167) 


20(21) 


15 


4 


0 


5(e) 


8(10) 


9 


26 * 


16 


mm ~% 

13 


4 


mw m f m\ t m '\ 

14(15) 


44 


24 


4 


39 


5(6) 


8(10) 


5 


10 


100(103) * 


* * 82 (84) 


6 . 


66 


11(13) 


5(7) 


2 


1 


4 


6 


4 


7 


4 


2 


3 


7 


14(32) 


17(46) 


4(14) 


27(61) 


0 


0 


0 


0 


17 


142(147) 


o s 


0 


33(83) . 


33(79) 


4(7) 


10(16) 


.. 5 


88 (161) 


1 


61 (74) 


18(24) 


17(24) 


15(20) 


243 (401) 


- 0 


0 


0 


0 


14 - 


9 


1 


7 


40(53) 


32(44) 


14(15) 


96 (105) 


5 


4 


0 


0 


2 


1 


1 


0 


127(199) 


,94(145) 


43 (73) 


232 (369) 
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5 


5 


.6(7) 


10(12) 


5(7) 
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6 


2 
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0 


23 


8 


22 
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Table 18 (Continued) 



Accession 
number 

PF02135 
PF01285 
PF02176 
PF00352 

PF0O567 
PF0O642 
PF00096 
PF00097 
PF00098 



Domain name 

Zf-TAZ 
TEA 

Zf-TRAF 
TBP 

TUDOR 
Zf-CCCH 
Zf-C2H2** 
Zf-C3HC4 
Zf-CCHC 



"Domain description 

TAZ finger 
TEA domain 
TRAF-type zinc finger 
Transcription factor TFJID (or TATA-binding 

protein, TBP) 
TUDOR domain 

Zinc finger. C-x8-C-x5-C-x3-H type (and similar) 

Zinc finger, C2H 2 type 

Zinc finger, C3HC4 type (RING finger) 

Zinc knuckle 



H 


F 


2(iT 




4 


1 


6(9} 


1(3) 


2(4) 


4(8) 


9 (?4) 


9 (19) " 


17(22) 


6(8) 


564(4500) 


234 (771) 


135 (137) 


57 


9(17) 


6(10) 



w 

6(7) 
1 

2(4) 

4(5) 
22 (42) 
68(155) 
88 (89) 
1 7 (33) 



0 
1 
0 

1(2) 
0 

3(5) 
34(56) 
18 
7(13) 



10(15 



2« 



31 (46) 
21 (24) 
298 (304) 
68 (91) 



* * » 

(Tables 18. and 19). They include secreted 
hormones and growth factors, receptors, in- 
tracellular signaling molecules," and transcrip- 
tion factors. 

Developmental signaling molecules that are 
enriched in the human genome include growth 
factors such as wnt, transforming growth fac- 
tor-p (TGF-p), fibroblast growth factor (FGF), 
nerve growth factor, platelet derived growth 
factor (PDGF), and ephrins. These growth fac- 
tors affect tissue differentiation and a wide 
range of cellular processes involving acrin-cy- 
toskeletal and nuclear regulation. The corre- 
sponding receptors of these developmental li- 
gands are also expanded in humans. For exam- 
ple, our analysis suggests at least 8 human 
; ephrin genes (2 in the fly, 4 in the worm) and 1 2 
ephrin receptors (2 in the fly, 1 in the worm). In 
the* wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the 
worm). The Groucho family of transcriptional 
compressors downstream in the wnt pathway 
axe even more markedly expanded, with 13 
predicted members in humans (2 in the fly, 1 in 
the worm). _ - 

Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and*;iissue repair 
(13 J). Consistenrwith the welf-defined role of 
heparan sulfate., proteoglycans in modulating 
these interactidns* (132), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative tcr worm and fly. 
These sulfotransferases modulate tissue differ- 
entiation (735). A similar expansion in humans 
is noted in structural proteins that constitute the . 
actm-cytoskeletal architecture. Compared with 
the fly and worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



. - Comparison across the.five sequenced eu- 
karyotic organisms revealed several expand- 
ed protein families and domains involved in 
cytoplasmic signal transduction (Table 18). 
In particular, signal transduction pathways 
playing roles in developmental regulation and 
acquired immunity were substantially en- 
riched. There is a factor of 2 or greater ex- 
pansion in humans in the Ras superfamily 
GTPases and the GTPase activator and GTP 
exchange factors associated with them. AI- 
. though there are about the same number of 
tyrosine kinases in the human and C. etegans . 
genomes, in humans there is an increase in 
. the SH2, PTB, and ITAM domains involved 
■** - in phosphotyrosine signal transduction. Fur- 
ther, there is a twofold expansion of phos- 
. : phodiesterases in..the human genome, com- 
pared with either the worm or fly genomes. 

The downstream "effectors of the intracellu- 
lar signaling molecules include the transcription 
. factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- 
, bmding nuclear hormone receptor class of tran- 
. , scription factors compared with the fly genome, 
although not to the extent observed in the worm 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins; Compared with 771 in 234 fly proteins. 
This means . that . there has been a dramatic 
expansion not only^ in the number of C2H2 
' transcription factors, but also in the number of 
these DNA-binding . motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average, in the worm)/ ' 
Furthermore, many of these transcription fac- 
tors contain either the KRAB* or SCAN do- 
. mains, which are not found in the fly or worm 
genomes. These domains are involved in the 
. : oiigomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortmeht of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



horoeodomains alone or in combination with 
Pou and LDvf domains in all of the animal 
genomes. In plants, however, a different set of 
transcription factors are expanded, namely, the 
myb family, and a unique set that includes VP1 
and AP2 dornair^-containirig proteins (134). 
The yeast genome has a paucity of transcription 
factors compared with the multicellular eu- 
karyotes, and its repertoire is limited to the 
expansion of the yeast-specific C6 transcription 
factor family involved in metabolic regulation! 
While we have illustrated expansions in a 

* subset of signal transduction molecules in the 
human genome compared with the other eu- 
karyotic genomes, it should be noted that 

. most of the protein domains are highly con- 
served. An interesting . observation, is that ' 

• worms and Humans have approximately the 
same number of both tyrosine kinases and 
serine/threonine kinases (Table 19). It is im- 
portant to note, however, that these are mere- 
ly counts of the catalytic domain; the proteins 
that contain these domains also display a 

- wide repertoire of interaction domains with 

- significant combinatorial diversity. 
Hemostasis. Hemostasis is regulated pri- 

marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral .to hemostasis are 
expanded in the human relative to the fly and 
worm (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FN1, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there, has been extensive re- 
cruitment of more-ancient animal-specific "do- ; 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomain proteins that are 
involved in hemostatic regulation. Although we 
do not find a large expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these mulddomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represented in plasma proteins that belong to 
the Jdnin and complement pathways. There is a 
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s Ig nificant expansion in two families of matrix 
metaUopioteases: ADAM (a disintegnn and 
metaUoprotease) and MMPs (matrix metaUo- 
proteases) (Table 19). Proteolysis of extracel- 
lular matrix (ECM) proteins is critical for tissue 
development and for tissue degradation in dis- 
eases such as cancer, arthritis, Alzheimer's dis- 

of inflammatory conditions 
{135 136). ADAMs are a family of integral 
membrane proteins with a pivotal role'in fibrin- 
• ogenolysis and - modulating interactions .be-" 
• tween hematopoietic components-, aid the ' 
vascular matrix components. These proteins 
have been shown to cleave matrix proteins, 
and even signaling molecules: ADAM-17 

!°^Iw S . rt t l lm0r necrosis fiWo«. and 
ADAM-10 has been implicated in the Notch 
signaling ; pathway {135). We have identified 
19 members of the matrix metaUoprotease 
f fT} y > "total of 51 members of the 
ADAM and ADAM-TS families. 

Apoptosis. Evolutionary conservation of 
some of the apoptotic pathway components 
across eukarya is consistent with its central 
role in developmental regulation and as a 
response to pathogens and stress signals. The 
signal transduction pathways involved in pro- 
grammed cell death, or apoptosis, are medi- 
ated by interactions between well-character- 
ized domains that include extracellular do- 
mains, adaptor (protein-protein interaction) 
domains, and those found in effector and" 
regulatory enzymes (137). We enumerated 
the protein counts of central adaptor and ef- ■ 
rector enzyme domains that are found only in 
the apoptotic pathways to provide an estimate 
of divergence across eukarya and relative 
expansion in the human genome when com- 
pared with the fly and worm (Table 18) 
Adaptor domains found in proteins restricted 
only to apoptotic regulation such as the DED 
domains are vertebrate-specific, whereas oth- 
ers like BIR, CARD, and Bcl2 are represent- 

ro ,*H fly .. and wom (^though the number 
)f BcI2 family members in humans is sienif- 
cantly expanded). Although plants and yeast 
ack the caspases. caspase-like molecules, 
amely the para- and meta-caspases, hayp 
cen reported in these organisms (138). Com- 
ared with other animal genomes, the human" 
enoroe shows an. expansion- in" the adaptor 
id effector domain-confining proteins in- 
slved in apoptosis, as well as in the-pro- 
ases involved in the cascade such as. (he 
tspase and calpain families. 

protein families. 
etabohc enzymes. There axe fewer cyto- 
rome P450 genes in humans than in either 
" X ! y or Lipoxygenases (six in hu- 

on the other hand, appear to be specific 
the vertebrates and plants, whereas the hp- 
fgenase-actfvaring proteins (four in humans) 
y be vertebrate-specific. Lipoxygenases are 
olved in arachidonic acid metabolism, and 
y and their activators have been implicated 
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in diverse human pathology ranging from 
allergic responses to cancers. One of the most 
surprising human expansions; however, is in 
" • the number of gIyceraldehyde-3 -phosphate 
dehydrogenase (GAPDH) genes (46 in hu- 
mans, 3 in the fly, and 4 in the worm). There 
is, however, evidence for many retrotrans- 



posed GAPDH pseudogenes (I39) t which 
may account for this apparent expansion 
However, it is interesting that GAPDH, Ion* 
known as a conserved enzyme involved- £ 
basic metabolism found across all phyla from 
bacteria to humans, has recently been shown 
to have other functions. It has a second cat- 



Table 19. Number of proteins assigned to selected Panther f* m ;r ., 

mettnoststeiff), C. l g3 ns (W). I cerev*/ae W^J^A?^^**" «L 9- . 
'." - Panther family/subfamily* h ~~ - • 



Ependyrnln 

Ion channels 

Acetylcholine receptor 

Amiloride-sensitive/degen erin 
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Ne urotransmitterrga ted 
P2X purinoceptor 
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Transient receptor 
Voltage-gated Ca*+ alpha * 
Voltage-gated Ca* + a!pha-2 * 
Voltage-gated Ca 2+ beta 
Voltage-gated Ca 2 + gamma 
Voltage-gated K* alpha 
Voltage-gated KQT 
Voltage-gated Na+ 
Myelin basic protein 
.Myelin PO. . 
Myelin proteolipld 

" Myelin-oligodendrocyte glycoprotein 
. Neuropils .* 
Plexin - * ♦ 
Semaphorin 
Synaptotagmin 
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Cytokinef 
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Interleukin 
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Cytokine receptorf 

"■* Braoykinin/C-C chemofcine receptor 
Fl cytokine receptor * 
Interferon receptor 
In terleuWn receptor , ' ** 
Leukocyte tyrosine kinase 
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MCSF receptor 
- TNF receptor * 
Immunoglobulin receptorf 
T-cell receptor alpha chain 
T-cell receptor beta chain 
T-cell receptor gamma chain 
T-cell receptor delta chain 
Immunoglobulin FC receptor 
Killer cell receptor 

Polymeric-ImmunoglobuUn receptor 
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alytic activity, as a uracil DNA glycosylase . , may account for many of these expansions 
(140) and functions as a cell cycle regulator - , .[see the discussion above and (143)]. Recent 



(141) and has even been, implicated in apo- 
ptosis (142). 

Translation. Another striking set of hu- 
man expansions has occurred in certain fam- 
ilies involved in the translational machinery. 
We identified 28 different ribosomal subunits 
.that each have at least 10 copies in the ge- 
nome; on average, for all ribosomal proteins . 
there is about an 8- to 10-fold expansion in 
the number of genes relative to either the 
worm or fly. Retrotransposed pseudogenes 

• 

». - . 
Table 19 (Continued) 



evidence suggests that a number of ribosomal 
. proteins have secondary functions indepen- 
dent of their involvement in protein biosyn- 
thesis; for example, L13a and the related L7 
subunits (36 copies in humans) have been 
. shown to induce apoptosis (144). 

. : There is also a four- to fivefold expansion 
in the elongation factor 1 -alpha family 
(eEFIA; 56 human genes). Many, of these 
expansions likely represent intronless para- 
logs that have presumably arisen from retro- 
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• transposition, and again there is evidence t 
many of these may be pseudogenes (U 
However, a second form (eEF!A2) of t 
factor has been identied with tissue-speci 
expression in skeletal muscle and a comp 
mentary expression pattern to the ubiquitoi 
ly expressed eEFIA (146). 
. Ribonucleoproteins. - Alternative splicl 
-.. results in .multiple transcripts from a sin£ 
. gene, and can therefore generate addition 
diversity in an organism's protein compl 
ment. We have identified 269 genes for i 
bonucleoproteins. This represents over 2 
times the number of ribonucleoprotein gent 
in the worm, two times that of the fly, ar 
about the same as the 265 identified in tl 
Arabidopsis genome. Whether the diversii 
of ribonucleoprotein genes in humans coi 
tributes to gene regulation at either the splic 
ing or translational level is unknown. 

Postradiational modifications. In thi 
set of processes, the most prominent expan 
sion is the transglutaminases, calcium-depen 
dent enzymes that catalyze the cross-linkin 
of proteins in cellular processes such as he 
mostasis and apoptosis (147). The vitarni: 
K- dependent gamma carboxylase gene prod 
uct acts on the GLA domain (missing in the 
fly and worm) found in coagulation factors 
osteocalcin, and matrix GLA protein (148) 
Tyrosylprotein. sulfotransferases participate 
in the posttranslau'onal modification of pro* 
teins involved in mflammation and hembsta- 
sis, including coagulation factors and chemo- 
kine receptors (149). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
ification, there are a number of domain ar- 
rangements in the predicted human proteins 
that are not found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poIy-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in humans. 

Concluding remarks. There are several 
possible explanations for' 1 the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate to the. prominent differences in 
the immune system, hemostasis, neuronal, 
vascular, and cytoskeletal complexity. The 
finding that the human genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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Table 19 (Continued) 

Panther family/subfamily* 



increase a the ability to mediate protein- 
protein interactions without dramatically in- 
creasing the absolute size of the protein com- 
plement (ISO). Evolution of apparently new 
(from the perspective of sequence analysis) 
protein domains and increasing regulatory 
complexity by domain accretion both quanti- 
tatively and qualitatively (recruitment of nov- 
el domains with preexisting ones) are two 
features that.we.observe in humans-. Perhaps 
the best, illustration of this trend is ihe C2H2 
zinc finger-containing •transcription factors"' 
where we see expansion in the number of 
domains per protein, together, with verte- 

c^A <: v"?' eClfic domabs s^h as KRAB and 
SCAN. Recent reports on the prominent use 
of internal nbosomal entry sites in the human 
genome to regulate translation of specific 
classes of proteins suggests that this is an area 
that needs further research to identify' the full 

n/n IIS 15 Pr0CCSS " huraan ^nomc 
(151). At the posttranslational level, although 

we provide examples of expansions of some • 
protein families involved in these modifica- 
tions further experimental evidence is re- 
quu-ed to evaluate whether this is correlated 
with increased complexity in protein process- 
ing. Posttranscriptional processing and the ' 
extent of isoform generation' in the .human 
remain to be cataloged in their entirety. Given rM . 
the conserved nature of the spliceosomal ma- claud/n 



C2H2 line finger- containing 
COE 61 

CREB 

ETS-related 

.Groucho - . 

Histone HI *" v 
• Histone H2A 
Histone H2B 
Histone H3 * 
Histone H4 

Homeoticf 
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Iroquois class 
Distal-less 
Engrailed 
LIM-con ta'ming 
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Paired box 
Six 

Leucine zipper 
Nuclear hormone receptorf 
Pou-related 
Runt-related 
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8 Conclusions 

8.1 The whole-genome sequencing 
approach versus BAC by BAC 

Experience, in applying the whole-genome 
shotgun sequencing approach to a diverse 
group of organisms with a wide range of 
genome sizes and repeat content allows us to 
assess its strengths and weaknesses. With the 
mcccss of the method for a large number of 
mcrobial genomes, Drosophila, and now the 
luman, there can be no doubt -concerning the 
itihty of this method. The large number of 
* crobial genomes that have been sequenced 
y this method (/J, 80; 152) demonstrate J that 
^g. ? basc-si 2 ed genomes, can be sequent 
ffictently without any input other that the de 
dvo mate-paired sequences*/ With mor e 
>mplex genomes like those of DrosopMao* 
iman map information, in the form of weli- 
•dered markers, has been critical for'ling- 

S E? 1 ** of scafroIds - For iobin ^ sca * 

las into chromosomes, the quality of the 
*P (in terms of the order of the markers) is 
>r< | ^portant than the number of markers 
: se Although this mapping could have 
; n performed concurrently with sequenc- 

ieficial. During the sequencing of the A. 
liana genome, sequencing of individual 
C clones permitted extension of the se- 
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quence well into centromeric regions.and al- 
lowed high-quality resolution of complex re- 
peat regions. Likewise, "in Drosophila, the 
BAC physical map was most useful in re- 
gions near the highly repetitive centromeres 
and telomeres. WGA has been found to de- 
liver excellent-quality reconstructions of the 
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predicting genes should limit this number. As 
was true at the beginning of genome sequenc- 
ing, ultimately it will be necessary to measure 
mRNA in specific cell types to demonstrate 
the presence of a gene. 

J. B. S. Haldane speculated in 1937 that a 
population of organisms might have to pay a 



* * ~* - ^u^utauuu «ji ujgcuuMii* migni nave to pay a 

unique regions of the genome. As the genome ., price for.the number of genes it can possibly 
size, and more imnnrtantlv th^ r^nAfW;^*. ~~ tt_;i - j ... . - 



size, and more importantly the repetitive con- 
tent, increases, the WGA approach delivers 
less of the repetitive sequence. 

The cost and overall efficiency of clone-by- 
- clone approaches makes them difficult to justify 
as a stand-alone strategy for future large-scale 
genome-sequencing projects. Specific applica- 
tions of B AC-based or other clone mapping and 
sequencing strategies to resolve ambiguities in . 
sequence assembly that cannot be efficiently 
resolved with computational approaches alone 
are clearly worth exploring. Hybrid approaches 
to whole-genome sequencing will only work if 
there is sufficient coverage in both the whole- 
genome shotgun phase and the BAC clone se- 
quencing phase.. Our experience with numan 
genome assembly suggests that this will require 
at least 3 X coverage of both whole-genome and 
BAC shotgun sequence data. 

8.2 The low gene number in humans 
We have sequenced and assembled —95% of - 
the euchromatic sequence of H. sapiens and 
used a new automated gene prediction me'th- 
od to' produce a preliminary catalog of the - 
human genes. This has provided a major sur-,. .. 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus musculus ge- 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in. the 
years to come as the precise structure of each 
transcription unit is evaluated: A good place 
to start is to deterrrfirie why the gene esti- 
mates derived Jrom 'EST data are so discor-. 
dant with our predictions. It is nicely that the 



carry. He theorized that when the number of 
genes becomes too large, each zygote carries 
so many new deleterious mutations that the 
population simply cannot. maintain itself. On 
the basis of this premise, and on the basis of 
available mutation rates and x-ray-induced 
mutations at specific loci, .MuIIer, in 1967 
{154), calculated that the mammalian ge- 
nome would contain a maximum of not much 
more than 30,000 genes {155). An estimate of 
30,000 gene loci for humans was also arrived 
at by Crow and Kimura (1 56). Midler's esti- 
mate for D. melanogaster was 10,000 genes, 
compared to. 13,000 derived' by annotation of 
the fly genome (26, 27). These arguments for 
the theoretical maximum gene number were 
based on simplified ideas of genetic load — 
that all genes have a certain low rate of 
mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
knockout mutations lead to almost no dis- 
. cemible phenorypic perturbations. * 

The modest number of human genes - 
means that we must look elsewhere for the 
■mechanisms, that generate the complexities, 
.inherent in. human development and. the so-so- 
phisticated . signaling systems that maintain : 
homeostasis. There, are a large number of 
ways in which the functions of individual 
genes and gene products are regulated. The 



of RNA editing in which coding cham 
occur directly at the level of mRNA is 
clinical and biological relevance (161). Fin. 
ly, examples of translational control inclu 
internal ribosomal entry sites that are foui 
in proteins involved in cell cycle regulahV 
and apoptosis *(/<£?). At the protein lev< 
minor alterations in tie .nature of proteL 
protein interactions, protein • modification 
and localization can have dramatic effects c 
cellular physiology (163). This dynamic sy.« 
tern therefore has many ways to modulat 
activity, which suggests that definition c 
complex systcpas by analysis of single gene 
is unlikely to be entirely successful. 

In situ studies have shown that the humai 
genome is asymmetrically populated wit) 
. G+C content, CpG islands, and genes (68) 
However, the genes are not distributed quit* 
as unequally as had been predicted (Table 9*. 
(69). The most G-fC-rich fraction of the ge- 
. nome, H3. isochores, constitute more of the 
genome than previously thought (about 9%), 
and are the most gene-dense fraction, but 
contain only 25% of the genes, rather than the 
predicted -40%. The low G+C L isochores 
make up 65% of the genome, and 48% of the 
genes. This inhomogeneity, the net result of 
millions of years of mammalian gene dupli- 
cation, has been described as the "desertifi- 
cation" of the vertebrate genome (71). Why 
are there clustered regions of high and low 
gene density, and are these accidents of his- 
tory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
possible to find mammalian genomes that are 
far smaller in size than the human genome. 
Indeed, many species of bats have genome 
sizes that are much smaller than that of hu- 



degree of "openness" of chromatin structure. m mucn srnauer mm ,„ at ofhu . 

and hence transcriptional ^activity is regulated , . -mans-; for example, Miniopterus, a species of 

V£5T COmpl ? es *. a i m . volve histone . *aKan bat, has a genome size thai is only 
and DNA enzymatic modifications. We enu- 50% that of humans (164). Similarly, Mun- 
merate many of the proteins that are likely tiatils, a species of Asian barking deer, has a 
mvol v.ed in nuclear regulation in Table 1 9. genome size that is -70% that of humans. 
The location, -tuning, and quantity of tran- 
scription are intimately linked to nuclear sig- "** 8.3 Human DNA sequence variation 
nal transduction events as well as by the and its distribution across the genome 
tissue-specific expression of many of these This is the first cukaryotic genome in which a 
proteins Equally important are regulatory nearly uniform ascertainment of polymoiphism 

the Jittle-understood vagaries of -RNA pro- : •; that modulate transcription. The spliceosomal 
cessing that often leave intronic regions in an . machinery consists of multisubunit proteins 
unsphced condition; the finding that nearly : (Table 19) as well as structural and catalytic 
7i or ^^f^ s ^^^y^y spliced /'RNA elements (159) that regulate transcript 

;' structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules (160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



(153); and finally, the unsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not uncommon. 
Of course, it is possible that there are genes 
that remain unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



cataloging SNPs is complete. These represent 
only a fraction of the SNPs present in the 
human population as a whole. Nevertheless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of population genetic forces, includ- 
ing mutation, migration, selection, and generic 
drift. The availability of a dense array of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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types present in subjects of different ethnogeo- 
graphic origins, providing insights into popula- 
tion history and migration patterns. Although 
such studies have suggested that modem human 
lineages derive from Africa, many important 
questions regarding human origins remain un- 
answered, and more analyses using detailed 
SNP maps will be needed to settle these con- 
troversies. In addition to providing evidence for 
population expansions, migration, 'and admix- 
ture, SNPs can serve as markers for the extent 
of evolutionary constraint acting on particular 
. genes. The correlation between patterns of in- 
• traspecies and interspecies genetic .variation 
may prove to be especially informative to iden- 
tify sites of reduced genetic diversity that may 
mark loci where sequence variations are not 
tolerated. 

The remarkable heterogeneity in SNP 
density implies that there are a variety of 
forces acting on polymorphism— sparse re- 
gions may have lower SNP density because 
the mutation rate is lower, because most of 
those regions have a lower fraction of muta- 
tions that are tolerated, or because recent 
strong selection in favor of a newly arisen 
allele "swept" the linked variation out of the 
population (165). The effect of random ge- 
netic drift also varies widely across the ge- 
nome. The nonrecombining portion of the Y 
chromosome faces the strongest pressure, 
from random drift because there are roughly 
one-quarter as many Y chromosomes in the . 
population as there are autosomal chromo-* 
somes, and the level of polymorphism on the 
Y is correspondingly less. Similarly, the X 
chromosome has a smaller effective popu- 
lation size than the autosomes, and its nu- 
cleotide diversity is also reduced. But even 
across a single autosome, the effective pop- 
ulation size can vary because the density of 
deleterious mutations may vary. Regions of 
high density, of deleterious mutations will 
see a greater rate of elimination by selec- 
tion, and the effective population size will 
be smaller (166). As a result, the density of 
even completely neutral SNPs will be lower 

i _ t . - 
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then docks on this, and then the complex 

moves there •• (167) to the exciting area 

of network perturbations, nonlinear re- 
sponses and thresholds, and their pivotal 
role in human diseases. 

The enumeration of other 'Mparts lists" re- 
veals that in organisms with complex nervous 
systems, neither gene number, neuron number, 
nor number of cell types correlates in any'; 
meaningful manner, with even, simplistic mea- 



8.5 Beyond single components 
While few would disagree with the intuitive 
conclusion that Einstein's brain was more 
complex than that ofDrosophila, closer com- 
parisons such as whether the set of predicted 
human proteins is more complex than* the 
protein set of Drosophila, and if so, to what 
degree, are not straightforward, since protein, 
protein domain, or protein-protein interaction 
^'nFe(n,,h; f t %. • - ~™ -. .measures do' not capture context-dependent* 

sures of structural .or behavioral complexity. ... interactions that underpin the ' dynamics un 
Nor would they be expected to; this is the realm . derlying phenotype. • ^ UlC dynamics im - 



of nonlinearities arid epigenesis (168). The 520 
million neurons of the common octopus exceed 
the neuronal number in the brain of a mouse by 
an order of magnitude. It is apparent from a 
comparison of genomic data on the mouse and 
human, and from comparative mammalian neu- 
.- roanatomy (169), that the morphological and 
behavioral diversity found in mammals is un- 
derpinned by a similar gene repertoire and sim- 
ilar neuroanatomies. For example, when one 
compares a pygmy marmoset (which is only 4 
inches tall and weighs about 6 ounces) to a 
chimpanzee, the brain volume of this minuter 
primate is found to be only about 1.5 cm 3 , two 
orders of magnitude less than that of a chimp 
and three orders less than that of humans. Yet 
the neuroanatomies of all three brains are strik- 
ingly similar, and the behavioral characteristics 
of the pygmy marmoset are little different from 
those of chimpanzees. Between humans and . 
chimpanzees, the gene number, gene smictures 
and functions, chromosomal and genomic or- . 
- ganizatiohs, and cell types and neuroanatomies 
• are almost. indistinguishable, yet the develop-" . 
mental modifications, that predisposed human 



Currently, there are more than 30 different 
mathematical descriptions of complexity (170). 
However, we have yet to understand the math- 
ematical dependency relating the number of 
genes with organism complexity. One pragmat- 
, ic approach to the analysis of biological sys- 
tems, which are composed of nonidentical ele- 
ments.(proteins, protein complexes, interacting 
cell types, and interacting neuronal popula- 
tions), is through graph theory (171). The ele- 
ments of the system can be represented by the 
vertices of complex topographies, with the edg- 
es, representing the interactions between them. 
Examination of large networks reveals that they 
can self-organize, but more important, they can 
be particularly robust This robustness is not 
due to redundancy, but is a property of inho- 
mogeneously wired networks. The error toler- 
ance of such networks comes with a price; they 
. are vulnerable to the selection or removal of a 
few nodes that contribute disproportionately to 
network stability. Gene. knockouts provide an * 
JUustratiorL'Some.krockputs may have minor 
effects, whereas others have catastrophic effects 
on the system In the case of vimentin, a sup- 



i- • . . . .* — w " .orient, jlu me case or vimentin, a sun- 

taw* cortical expans.cn and development , posedly critical, component of the cytoplasmic 



of the larynx, giving rise to language, culminat- 
ed in a massive singularity that by even the 
simplest of criteria made humans more com- 
plex in a behavioral sense. 

Simple examination of the number of neu- 
rons, cell types, or genes or of the genome 
size does hot alone account for the differenc- 
es in complexity that we observe. Rather, it is 
the interactions within and among these sets 



• .„.t -r.. . *itiw 4 o^vivjji wiuuii ami mnong inese sets 

In th? n? l °- S ;- T 15 a Iarge Ht - erature - - that result in suoh great variation, m addition 
on the association hpfw^n .qmp -tV . T . T . . * ■ u » 



on the association between SNP /density 
^and local recombination rates in Drosoph- 
ila, and it remains an important task to 
zsscss the strengt^-of this association in the 
human genome, because of its impact on 
the design of local SNP densities for dis- 
ease-association studies. It also-remains an 
important task to validate SNPs on a 
genomic scale in order to assess the degree 
of heterogeneity among geographic and 
ethnic populations. - 

8.4 Genome complexity * 
We will soon be in a position to move away 
from the cataloging of individual compo- 
nents of the system, and beyond the sim- 
plistic notions of "this binds to that, which 



it is possible that there are "special cases" of 
regulatory gene networks that have a dispro- 
. portionate effect on the overall system. We 
have presented several examples of "regula- 
tory genes" that.are "significantly increased in 
the human genoniexompared with the fly and 
worm. These include extracellular - ligands 
and their cognate receptors (e.g., wnt; friz- * 
zled, TGF-p, ephrins, and connexins), as well 
as nuclear regulators (e.g., the KRAB and 
homeop"omain transcription factor families), 
where a few proteins control broad develop- 
mental processes. The answers to these 
"complexities" perhaps lie in these expanded 
gene families and differences in the regulato- 
ry control of ancient genes, proteins, path- 
ways, and cells. 



intermediate filament network of mammals, the 
knockout of the gene in mice reveals them to be 
reproductively normal, with no obvious pheno- 
* typic effects (172), and yet the usually conspic- 
uous vimentin network is completely absent 
On the other hand, —30% of knockouts in 
Drosophila and mice correspond to critical 
nodes whose reduction in gene product, or total 
elimination, causes the network to crash most 
of the time, although even in some of these 
cases, phenorypic normalcy ensues, given the 
appropriate genetic background. Thus, there are 
no "good" genes or "bad" genes, but only net- 
works that exist at various levels and at differ- 
ent connectivities, and at different states of 
sensitivity to perturbation. Sophisticated math- 
ematical analysis needs to be constantly evalu- 
ated against hard biological data sets mat spe- 
cifically address network dynamics. Nowhere is 
this more critical than in attempts to come to 
grips with "complexity " particularly because 
deconvoluting and correcting complex net- 
works that have undergone perturbation, and 
have resulted in human diseases, is the greatest 
significant challenge now facing us. 

It has been predicted for the last 15 years 
that complete sequencing of the human ge- 
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. nome would open up new strategies for hu- 
man biological research and would have a 
major impact on medicine, andThrough med- 
icine and public health, on society. Effects on 
biomedical research are already being felt. 
This assembly of the human genome se- 
quence is but a first, hesitant step on a long 
• and exciting journey toward understanding 
the role of the genome in human biology. It . 
has been possible only because of innova- 
tions in instrumentation and software that 
have allowed automation of almost every step 
of the process from DNA preparation to an- . 
notation. The next steps are clear: We must 
. define the complexity that ensues when this 
: relatively modest set of about 3 0,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the generics, biochem- 
istry, physiology, and ultimately phenotype 
depend. It provides the boundaries for scien- 
tific inquiry. The sequence is only the first 
level of understanding of the genome All 
genes and their control elements must be 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- 
tion worldwide described;" and the relation 
between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 
. . Another paramount challenge awaits-' ' 
public discussion of this information aud its" ' 
potential for improvement of personal health 
Many diverse sources of data have shown 
that any two individuals are more than 99 9% 
identical in sequence, which means that all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
the idea that all characteristics of the person 
are hard-wired" by the genome; abd reduc- 
tions, the view that with complete knowl- 
edge of the human genome* sequence it is 
only a matter of time before our understand- 
» . and 

interactions will 
provide a complete causal description^* hu- "* " 
man variability: The real challenge- of human. / - ' 
biology, beyond the task ,of finding out how 

construction and main- 
tenance of the miraculous mechanism-cf our 
bodies, will lie ahead as we seek to explain 
how our -minds have .come to 'organize 
thoughts sufficiently well to investigate our 
own existence. . 
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Identity were then txammed on the basis of their 
high-scoring pair (HSP) coordinates on the scaffold • 
to remove redundant hits, retaining hits that sup- 
ported possible alternative splicing. For BLASTX ' 
searches, analysts was performed separately for se- 
lected model organisms (yeast mouse, human. C 
e/e^ans. and O. melanogaster) so as not to. exclude 
HSPs from these organisms that support ihe'sarne 
gene structure. Sequences producing BLAST hits 
Judged to be Informative, nonredundant. and suffi- 
aently.sirnilar to the scaffold sequence were then 
realigned to the genomic sequence with Sim4 for 
tSTs, and with lap for proteins. Because both of 
these algorithms take splicing Into account the 
resulting alignments usually give a better represen- 
tation of mtron-exon boundaries than standard .- 
BLAST analyses and thus facilitate further annota- ' 
tion (both machine and human). In addition to the 
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.89.. Lek first compares all proteins fn the proteome to 
one another. Next the resulting BLAST reports are 
.. pursed, and a graph Is created wherein each protein 
constitutes a node; any hit between two proteins 
with an expectation beneath a user-specified 
threshold constitutes an edge. Lek then uses this 
graph to compute a similarity between each protein 

simply o-rviding the number of BLAST hits shared in 
common between the two proteins by the total 
number o^ proteins hit by / and/ Thissimple metric 
has several interesting properties. First because the 

S^TSS* ^« tokM Fnto account b ^ th the 5?mf - ■ 

~ 1 between the two sequenc- 

es at the level of BLAST hits, the metric respects the 
multidomafn nature of protein space. Two" Jrnuttido- 
, maiq. proteins, for Instance, each containing do- 
mains A and B. will have a greater pa irwise similarity 
to each other than either one will have to a protein 
containing on y A or B domains, so long as A-B- 
containing multidomain proteins are less frequent In . 
the proteome than are singte-domain proteins con- 
taining A or B domarru. A second Interesting prop- 
erty of this similarity metric Is that It can be used to 
produce a similarity matrix for the proteome as a 
whole without having to first produce a multiple 

!n5T <n !. f0f C3ch P fot€ln famn ^ a" error-prone 
^y ^ry time-consuming process. Finally, the met- 
ric does not require that either sequence have sig- 
nificant homology to the other In order to have a 
defined slmilanty to each other, only that they 
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share at least one significant BLAST hit in common. 
This Is an especially interesting property of the 
metric, because It allows the rapid recovery of pro- 
tein families from the proteome for which no mul- 
tiple alignment is possible, thus providing a compu- 
tational basis for the extension of protein homology 
searches beyond those of current HMM- and profile- 
based search methods. Once the whole-proteome 
similarity matrix has been calculated. Lek first par- 
titions the proteome into single-linkage clusters 
(27) on the basis of one or more shared BLAST hits 
between two sequences. Next, these single-linkage 
dusters are further partitioned Into subclusters, 
each member of which shares a user-specified pair- 
wise similarity with the other members of the clus- . 
.ter # as described above. For the purposes of this 
publication, we have focused on the analysis of 
single-linkage dusters and what we have termed 
•complete dusters,' e.g., those subclusters .for 
which" ever/member has a similarity metric of 1 to 
every other member of the subcluster. We" believe 
that the single-Unkage and complete clusters are of 
special Interest, In part, because they allow us to 
estimate and to compare sizes of core protein sets 
In a rigorous manner. The rationale for this Is as 
follows: if one imagines for a moment a perfect 
dusterfng algorithm capable of perfectly partition- 
ing one or more perfectly annotated protein sets 
Into protein families. It Is reasonable to assume that 
the number of dusters will always be greater than, 
or equal to, the number of single -linkage clusters, 
because single-Unkage dustering is a maximally ag- 
glomerative dustering method. Thus, If there exists 
a single protein In the predicted protein set contain- . 
ing domains A and B. then ft "will be clustered by 
single linkage together with all single~-<Jomain pro- 
teins containing domains A or B. Likewise, for a 
predicted protein set containing a single multido- 
main protein, the number of real clusters must " 
, always be less than or equal to the number* of ■-" 
complete dusters, because it Is impossible to place 
a unique multidomain protein Into a complete dus-' 
ter. Thus, the single-Unkage and complete dusters 
plus singletons should comprise a lower and upper 
. bound of sizes of core protein sets, respectively, 
allowing us to compare the relative size and com- 
plexity of different organisms* predicted protein set 
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- umanity has been given a great gift. With the completion of the human.. . 
genome sequence, we have received a powerful tool for unlocking the 
secrets of our genetic heritage and for finding our place among the other 
participants in the adventure of life. * 

This week's issue of Science contains the report of the sequencing of 
the human genome from a group of authors led by Craig Venter;of Celera v - 
Genomics. The report of the sequencing of the human genome from the 
k publicly funded consortium of laboratories led by Francis Coltins appears 
^^jS^^? in this week's Nature. This stunning achievement has been portrayed — 
often unfairly — as a competition between two 
ventures, one public and one private. That characterization detracts from ^ 
the awesome accomplishment jointly unveiled this week In truth, each . r hlStOTI C 

project contributed to the other. The inspired vision that launched the 
publicly funded project roughly 1 0 years ago reflected, and now rewards, , _ 
the confidence of those who believe that the pursuit of large-scale funda- ITI OTTien L TOl 
mental problems in the life sciences is in the national interest The technical 

innovation and drive of Craig Venter and his colleagues made it possible fU^x cr iphtlf IC 
to celebrate this accomplishment far sooner than was believed possible. LUCOUCI "r 1 1 1 ^ 
Thus, we can salute what has become, in the end, not a contest but a 

marriage (perhaps encouraged by shotgun) between public funding and endeaVOF. 
private entrepreneurship. 

There are excellent scientific reasons for applauding an outcome that ^ . . - 

has given us two winners. Two sequences are better than one; the opportunity for comparison and con- 
veroence is invaluable. Indeed, a real-world proof of the importance of access to both sets of data can 
be found in the pages of this issue of Science, in the comparative analysis by Olivier et ah (p. 1298). < 

Althoueh we have made the point before, it is worth repeating that the sequencing of the human 
genome represents, not an ending, but the beginning of a new approach to biology. As Galas saysm 
his Viewpoint (p. 1257), the knowledge that all of the genetic components of any process can be 
identified will give extraordinary new power to scientists. Because of this breakthrough, research 
can evolve from analyzing the effects of individual genes to a more integrated view that examines 
whole ensembles of genes as they interact to form a living human being. Several articles in this issue 
highlight how this approach is already beginning to revolutionize the way we look at human disease. 

This has been a massive project, on a scale unparalleled in the history of biology, but of cours? 
it has built on the scientific insights of centuries of investigators By coincidence, this landmark 
announcement falls during the week of the anniversary of the birth of Charles Darwin Darwin s . 
message that the survival of a species can "depend on its ability to evolve in the face of change is 
peculiarly pertinent to discussions that have gone on in the past year oyer access to me Celera dafc. 
Full information regarding the agreements that were reached to make the data available : c an be 
found at W vw.sciencemag.org/feature/data/armouncement/gsp.sW ) We are willing to be flexib e £ : 
allowing data repositories other than the traditional GenBank, while insisting on access to all the . 
data needed to verify conclusions. In this domain, change is everywhere: Corrxxriercial rese^claere 
are producing more and more potentially valuable sequences, yet (at Jeast in the United States) 
Ls governing databases provide scant protection against piracy. Had the Celera data been kept se- 
er* ft would have been a serious loss to the scientific community. We hope that our adaptability in 
the face of change will enable other proprietary data to be published after peer review, in a way that . 
satisfies our continuing commitment to full access. . ■ • *t- . 

It should be no surprise that an achievement so stunning, and so carefully watched^ has created 
new challenges for the scientific venture. Science is proud to have played a role in bringmg this 
otscovery oL the public stage. It is literally true that this is a historic moment for the scientific en- 
deavor. The human genome has been called the Book of Life. Rather, ,t is a ™g v 

rules that encourage exploration and reward creativity, we can find many of the books that will , 

help define us and our place in the great tapestry of life. - <• v^+aJ* 

v Barbara R. Jasny and Donald Kennedy 



EXHIBIT M 



Query= SEQ ID N0:1 

(3633 letters) 

Score E 

Sequences producing significant alignments: (bits) Value 

AL136167. 8. 1.63946 2221 0.0 

AL161778. 19. 1.64150 559 e-156 

>AL136167. 8. 1.63946 

Length = 63946 

Score = 2221 bits (1120), Expect =0.0 
Identities = 1120/1120 (100%) 
Strand = Plus / Plus 

Query : 1 atgacatctagtaatacccaacctctgcttatgacttcc tggaacatacccacagctgaa 60 

1 1 [ 1 1 1 1 E ! 1 1 1 r I j 1 1 1 1 1 [ i i 1 1 j 1 1 1 1 1 j 1 1 E 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 f E 1 1 1 1 1 1 1 1 

Sbjct: 25701 atgacatctagtaatacccaacc tctgct tatgacttcc tggaacatacccacagctgaa 25760 
Query: 61 ggttctcagtttccaatttccaccactattaatgtacctacatccaatgagatggaaaca 12 0 

1 1 M 1 1 E i 1 1 1 1 1 1 1 1 1 1 1 1 I I I L ! 1 1 1 1 M I 1 1 1 1 9 1 I I E 1 1 1 1 1 1 ] 

Sbjct: 25761 ggttctcagtttccaatttccaccactattaatgtacctacatccaatgagatggaaaca 25820 
Query: 121 gagac tctacacct tgttcctgggcctttgtcaacattcacagcctctcagactggtcta 180 

lllllllllllllllllllllllll MM! Mlllll 1 1 II II Mill II II llllllll 

Sbjct: 25821 gagac tctacaccttgttcc tgggcctttgtcaacattcacagcctctcagac tggtcta 25880 
Query: 181 gtatctaaagatgtcatggcaatgtcatcaattcctatgtcaggaattcttcctaaccat 240 

1 1 1 1 1 1 1 1 1 E 1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 ti 1 1 1 1 1 1 1 1 II I 

Sbjct : 25881 gtatctaaagatgtcatggcaatgtcatcaattcctatgtcaggaattc ttcctaaccat 25940 
Query: 241 gggctttctgagaacccttcattatcaacatctttaagagctatcacttccacattggct 3 00 

IIIIMIIIIIIIEIIIIIIIIIIIMIIIIIIIIIIIIIIIIIIIIIII MIMIIMI 

Sbjct : 25941 gggctt tc tgagaacccttcat tatcaacatctttaagagctatcacttccacattggct 26000 
Query: 3 01 gacgttaagcacacatttgagaaaatgaccacatctgtaactcctgggaccacactccca 3 60 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct : 2 6001 gacgttaagcacacatttgagaaaatgaccacatctgtaactcctgggaccacactccca 26060 
Query: 3 61 tcaattctttctggtgccacttcaggatctgtaatttcaaagtcacccattctgacatgg 42 0 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 j 1 1 1 

Sbjct: 26061 tcaat tc tttctggtgccacttcaggatctgtaatttcaaagtcacccattctgacatgg 26120 
Query: 421 ctct tatctagtctcccttctggctcccctccggcaactgtatctaatgcccctcatgtt 480 

II II 1 1 M III Mill II MM II I II II llllllll I II I Mlllll 1 1 1 1 Ml II 1 1 1 

Sbjct: 26121 ctcttatctagtctcccttctggctcccctccggcaactgtatctaatgcccctcatgt t 26180 



Query* 481 atgacttcctctacagtagaggtgtcaaaatcaacatttctgacatctgacatgatatca 540 

Ml MM II I II Mil l MINIMI Mill MINIUM II Ml 1 1 II II MINIM 

Sbjct : 26181 atgacttcctctacagtagaggtgtcaaaatcaacatttctgacatctgacatgatatca 26240 
Query ■ 541 gcgcacccattcactaacttgacaacactaccctctgctactatgagcaccatactcacc 600 

II II II 1 1 1 II I II 1 1 1 1 M II 1 1 1 1 II M 1 1 1 1 1 II II 1 1 II 1 1 II I II I II II 1 1 1 II 

Sbjct : 2 6241 gcgcacccattcactaacttgacaacactaccctctgctactatgagcaccatactcacc 26300 
Query 601 cgaaccattcctacacctacactgggtggtatcactactggcttcccaacttctctccct 660 

1 1 1 1 1 II II 1 1 II II II II II I II 1 1 M I Nil I II 1 1 1 1 M II I M 1 1 1 II 1 1 II 1 1 II 

Sbjct: 26301 cgaaccattcctacacctacactgggtggtatcactactggcttcccaacttc tctccct 26360 
Query- 661 atgtctataaatgtcacagatgacattgtgtacatttccacacaccctgaggcatcctcc 72 0 

1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 II 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 M 

Sbjct: 26361 atgtc tataaatgtcacagatgacattgtgtacatttccacacaccctgaggcatcctcc 26420 
Query: 721 agaaccacaataactgccaaccccaggactgtgtctcatccttcatccttcagcagaaag 780 

I 1 1 E I E I E I 1 1 I E f I I 1 1 1 1 1 1 1 I I 1 1 1 1 E 1 1 1 1 1 1 1 1 1 I I I I I 1 I f i I 1 1 I I i 1 E 1 1 I ] 

Sbjct: 2 6421 agaaccacaataactgccaaccccaggactgtgtctcatccttcatccttcagcagaaag 2 6480 
Query : 781 actatgtcaccttctacaactgaccacactctatctgttggtgccatgcctctgcctagc 840 

NIMINMINMIIMMMIMMNMIMNMININMINMIIMIIII 

Sbjct : 26481 actatgtcaccttc tacaactgaccacac tctatctgttggtgccatgcctctgcctagc 26540 
Query : 841 tctacaataacatcttcatggaacagaattccaac tgcatcatcaccctc tact ttaatt 900 

I I I I I I I I I I I I 1 1 I I I I ! i I 1 I II I I 1 I I I i I 1 Ill MM 

Sbjct : 26541 tctacaataacatcttcatggaacagaattccaactgcatcatcaccc tctactttaatt 26600 
Query: 901 attcctaagcccacactggactcccttctaaatataatgactactacatccactgttcct 960 

Mill II II Mllllil Mill I I III IMIMMMM II I III I Mill M MUM . 

Sbjct: 26601 attcctaagcccacactggactcccttctaaatataatgactactacatccactgttcct 26660 

♦ 

Query: 961 ggagcctcatttccactcatatccactggggtgacatatcc t tttacagcaactgtgtct 1020 

I MM II 1 1 II MM 1 1 lllllll I IIIIIIIIIIIIMI M 1 1 II I M M 1 1 1 1 1 1 1 1 1 

Sbjct: 26661 ggagcctcatttccactcatatccactggggtgacatatccttttacagcaactgtgtct 26720 
Query: 1021 tcaccaatatcgtccttttttgaaacaacttggctggactccacaccttcctttctatct 1080 

1 1 E i I ! 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 ) 1 1 1 1 E I E 1 1 1 1 1 1 1 1 1 

Sbjct: 26721 tcaccaatatcgtccttt tttgaaacaacttggctggactccacacct tcctt tc tatct 26780 

* 

Query: 1081 acggaagcatcgacttcgcctactgccaccaagtccacag 112 0 

I III II Ml I MM II M Mill 1 1 IIIIIIIIIIIIMI 

Sbjct: 26781 acggaagcatcgacttcgcctactgccaccaagtccacag 26820 



Score = 424 bits (214), Expect = e-115 
Identities = 217/218 (99%) 
Strand = Plus / Plus 



Query 1789 gagaatcctgaggatgttgcagagcatattttaaatttgataaatgaatccccagccctg 1848 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 ii 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct : 47714 gagaatgctgaggatgttgcagagcatattt taaatttgataaatgaatccccagccctg 47773 
Query: 1849 ggtaaagaagagacaaagattattgtttctaaaatatcagatatttcacaatgtgatgag 1908 

MMIIIMIIIIIIIIIIIIIIIIIIIMIIIIIIIIMIIIIIMIIII lllllllll 

Sbjct: 47774 ggtaaagaagagacaaagattattgtttctaaaatatcagatatttcacaatgtgatgag 47833 
Query: 1909 ataagtatgaacctaactcatgttatgttacaaataatcaacgttgttttggaaaagcaa 1968 

II I IMIIM lllllll IMMI I II MM lllllll llllllllll 1 1 II III MM II 

Sbjct: 47834 ataagtatgaacctaactcatgttatgttacaaataatcaacgttgttttggaaaagcaa 47893 

Query: 1969 aacaattccgcctctgatctgcatgaaataagcaatga 2006 

I I I I I I I I I I I I I I I I I 1 I I I I I I I 1 I I I I I I I I I I I I 

Sbjct: 47894 aacaattccgcctctgatctgcatgaaataagcaatga 47931 



Score = 329 bits (166), Expect = 7e-87 
Identities = 166/166 (100%) 
Strand = Plus / Plus 

Query: 1372 agt tgtgtttgtcaggtcatcataaaagccagctcttccttagcatcctctgaattgatg 1431 

1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 i 1 1 1 1 1 E 1 1 L 1 1 1 1 1 i M 1 1 1 1 1 1 1 1 1 1 ! 1 1 

Sbjct: 35677 agttgtgtttgtcaggtcatcataaaagccagctcttcc ttagcatcctctgaattgatg 35736 
Query: 143 2 agaaaaatcaaaagUaaatacatggcaacttcacacatggaaacttcacacaagatcaa 1491 

MMMIIMIIIMI MIMIII M II IIIIIIIIIMIIIMIMIIII IMIIIII I 

Sbjct: 35737 agaaaaatcaaaagtaaaatacatggcaacttcacacatggaaacttcacacaagatcaa 35796 
Query: 1492 ttgacgttattagtaaactgtgaacacgt tgcagtgaaaaaactag 1537 

INI lllllll MM 1 1 Mil II I II III I MM HIM 1 1 II I II 

Sbjct: 35797 ttgacgttat tagtaaactgtgaacacgttgcagtgaaaaaac tag 35842 



Score = 297 bits (150), Expect = 2e-77 
Identities = 159/162 (98%) 
Strand = Plus / Plus 

Query: 2 007 aattctgaggataattgagcgtcc tggtcacaagatggagttttctgggcagatagcaaa 2 06 6 

IIIIMIIIIIIIIIIIIIIII lllllllllllllllllllllllllllllllllllll 

Sbjct: 49289 aattctgaggataattgagcgtactggtcacaagatggagttttctgggcagatagcaaa 49348 



Query: 2 067 tctggcggtggccgggctggctttggctgtgctgcggggggaccacacgtttgatggcat 212 6 

1 1 I I Mill I II I M II I llilill llllillillllllll M M llllli II II MM 

Sbjct: 49349 tctgacggtggccgggctggctttggctgtgctgcggggggaccacacgt ttgatggcat 49408 



Query: 2127 ggctttcagcattcactcctatgaagaaggcccagaccctga 2168 

III Ml I MMII I MM 1 1 MM I II MM I II MM II I 

Sbjct: 49409 ggctttcagcattcactcctatgaagaaggcacagaccctga 49450 



Score = 270 bits (136), Expect = 6e-69 
Identities = 136/136 (100%) 
Strand = Plus / Plus 



Query: 1535 tagagcctggaaattgcaaagctgatgaaacagcctctaaatacaaagggacctataagt 1594 

1 1 1 1 1 1 1 M II 1 1 II II 1 1 1 II 1 1 1 1 1 1 II II 1 1 1 1 1 II 1 1 1 1 1 1 1 1 II 1 1 1 M II 1 1 1 1 

Sbjct : 37839 tagagcctggaaattgcaaagctgatgaaacagcctctaaatacaaagggacctataagt 37898 



Query: 1595 ggctattaaccaaccctacggagacagcccaaaccagatgcataaaaaatgaggatggaa 1654 

1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1! 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 37899 ggctattaaccaaccctacggagacagcccaaaccagatgcataaaaaatgaggatggaa 37958 



Query: 1655 atgccacaagattc tg 1670 

llllillillllllll 

Sbjct: 37959 atgccacaagattctg 37974 



Score = 238 bits (120), Expect = 2e-59 
Identities = 120/120 (100%) 
Strand = Plus / Plus 



Query : 
Sbjct : 



167 0 gttcaatcagcatcaacacgggcaaatctcagtgggaaaagccaaagtttaaacaatgca 17 2 9 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 ! 1 1 1 1 1 1 1 1 1 1 II I i 1 1 1 1 1 1 1 1 1 II 1 1 1 

3 9863 gttcaatcagcatcaacacgggcaaatctcagtgggaaaagccaaagtttaaacaatgca 3 9922 



Query: 173 0 aattgcttcaagaacttcctgacaagattgtggatcttgctaatattaccataagtgatg 17 89 

1 1 1 1 II 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 II II 1 1 1 1 1 1 II 1 1 1 1 Mi 1 1 M 

Sbjct: 39923 aattgcttcaagaacttcctgacaagattgtggatcttgctaatattaccataagtgatg 39982 



Score = 196 bits (99), Expect = 7e-47 
Identities = 99/99 (100%) 
Strand = Plus / Plus 



1117 acagtttccttctacaatgttgaaatgagcttctctgtctttgttgaagagccaaggatc 117 6 

i ii i ii ii ii 1 1 ii 1 1 1 ii ii 1 1 m 1 1 ii i ii 1 1 1 it ii ii ii ii 1 1 ii im ii 1 1 1 1 1 

27830 acagtttccttctacaatgttgaaatgagcttctctgtctttgttgaagagccaaggatc 27 889 



t 



i 



Query: 1177 cctattaccagtgttataaatgaatttacggaaaattcg 1215 

1 1 1 1 e 1 1 1 r 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 r 1 1 1 1 1 1 1 i 1 1 1 1 1 e 

Sbjct: 27890 cctattaccagtgttataaatgaatttacggaaaattcg 27928 



Score = 133 bits (67), Expect = 8e-28 
Identities = 67/67 (100%) 
Strand = Plus / Plus 

Query: 1215 gttgaattctatatttcagaacagtgaattttctcttgctactctggaaacccaaattaa 1274 

IMIIMMIIMI I MM II II I llllll IIIIMM II M II Mill I llllllllll 

Sbjct: 29723 gttgaattctatatttcagaacagtgaattt tctcttgctactctggaaacccaaattaa 29782 



Query: 1275 aagcagg 1281 

lllllll 

Sbjct: 29783 aagcagg 29789 



Score = 105 bits (53), Expect = 2e-19 
Identities = 53/53 (100%) 
Strand = Plus / Plus 

Query: 13 21 ttggaacagagagaaggacaagaaatggctacaatttcctatgtaccatacag 1373 

1 1 1 1 1 1 1 1 1 1 M 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 J 1 1 1 1 1 i 1 1 i 1 1 1 1 1 1 1 1 

Sbjct : 34091 ttggaacagagagaaggacaagaaatggctacaatttcc tatgtaccatacag 34143 



Score = 85.7 bits (43), Expect = 2e-13 
Identities = 43/43 (100%) 
Strand = Plus / Plus A 



Query: 127 8 cagggacatttcagaggaagagatggtcatggatcgagctatt 1320 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M i i 1 1 1 1 1 M M 1 1 1 1 1 M 1 1 1 1 1 1 1 1 

Sbjct: 32510 cagggacatttcagaggaagagatggtcatggatcgagc tatt 32552 



>AL161778 .19.1. 64150 

Length = 64150 

Score = 559 bits (282), Expect = e-156 
Identities = 282/282 (100%) 
Strand = Plus / Plus 



Query: 3 045 t tgttggattaaagatgattctatctt ttacatctcagtggtggcttattt ttgcctcat 3104 

I Ml I II II I III II 1 1 1 1 1 I 1 1 1 1 1 1 1 1 1 1 II I II I M I II 1 1 1 1 1 1 1 1 ! 1 1 1 1 1 1 1 II 

Sbjct : 18233 ttgttggattaaagatgattctatcttttacatctcagtggtggcttatttttgcctcat 18292 



Query : 
Sbjct : 



3105 



18293 



atttctcatgaatctctccatgttctgcactgttcttgttcaactgaattctgtgaaatc 

IIIIIIIMIIIIIMIIIIMIIMIMIIIIIIIIIIMIIIIIIIIMIIMIIIII 

atttctcatgaatctctccatgttctgcactgttcttgttcaactgaattctgtgaaatc 



3164 



18352 



Query: 3165 ccaaatccagaagactcggcggaagatgatcctgcatgacctcaaaggcacaatgagcct 3224 

I I Mill III III Mill II II MM III MINIMI MINI II II I MINI I II II 

Sbjct : 183 53 ccaaatccagaagactcggcggaagatgatcctgcatgacctcaaaggcacaatgagcct 18412 



Query: 322 5 gacattc t tacttggcctcacctgggggtttgcattttttgcttggggacccatgaggaa 3284 

I II III I MM I II III I MM MM MM IMIIMI 1 1 MM 1 1 II 1 1 Ml 1 1 I II 1 1 

Sbjct: 18413 gacattcttacttggcctcacctgggggtttgcattttttgcttggggacccatgaggaa 18472 



Query : 
Sbjct : 



3285 



18473 



ctttttcttgtatttgtttgccatttttaacactttgcaagg 

i n n 1 1 1 mi m mi mi n mm mil inn mi n i 

ctttttcttgtatttgtttgccatttttaacactttgcaagg 



3326 



18514 



Score = 535 bits (270), Expect = e-149 
Identities = 270/270 (100%) 
Strand = Plus / Plus A 



Query : 
Sbjct: 



2697 



12389 



caaacttcgaaaagattatcc tgccaaaattctgatcaacctgtgcacagcactactgat 

1 1 III I lllllll MINI 1 1 III MM MINIMI I INN I MM M Mil 1 1 III 

caaacttcgaaaagattatcctgccaaaattctgatcaacctgtgcacagcactactgat 



2756 



12448 



Query: 27 57 gctaaacctggtatttttgatcaattcttggttgtcatcatttcagaaagtgggagtttg 2816 

I I I II II MINI IMMII I I III MM llllllll II lllll M II III MM II I II 

Sbjct: 12449 gctaaacc tggtatt tttgatcaattc ttggttgtcatcatttcagaaagtgggagt ttg 12508 



Query 
Sbjct 



2817 



12509 



tatcacagctgcagtggcact teat tact tec tgcttgtttcttttacttggatgggcct 

MM I lllllll II II II II III ill MIIMINN NNNNINN INI II I II 

tatcacagctgcagtggcact teat tacttcc tgcttgtttcttttacttggatgggcct 



2876 



12568 



Query 
Sbjct 



2 877 ggaggcagtccacatgtatttggctctagtcaaagtcttcaacatatacattccaaatta 2 93 6 

lllllll 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 

12569 ggaggcagtccacatgtatttggctctagtcaaagtcttcaacatatacattccaaatta 12 628 



Query: 2937 tatccttaaattttgtctagttggttgggg 2966 

1 1 I I 1 1 1 1 1 1 1 M Mi I i I II 1 1 1 1 1 1 ill 

Sbjct: 12629 tatccttaaattttgtctagttggttgggg 12658 



Score = 335 bits (169), Expect = le-88 
Identities = 169/169 (100%) 
Strand = Plus / Plus 



Query: 3427 gatgggagcagccggtgtcagataaaggttggatataaacaggagggactaaagaaaatc 348 6 

IIIIIIIIIIMIIIIIIIIMIIIIIMIIIIIMIMMI IMIIIIIIIIIIIIIII 

Sbjct: 26700 gatgggagcagccggtgtcagataaaggttggatataaacaggagggactaaagaaaatc 26759 



Query : 
Sbjct: 



3487 tttgagcacaaactgttgacgccatctctcaagtcaactgcaactagctccactttcaaa 

II I I lllllll Mill II III ! I ! I MIMM III II Ml I Ml II II Ml MMIM II 

2 6760 tttgagcacaaactgttgacgccatctctcaagtcaactgcaactagctccactttcaaa 



3546 



26819 



Query: 3547 tctttaggctctgcacaaggcacaccttcagaaataagctttccaaatg 3 5 95 

I i 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 

Sbjct : 26820 tct t taggctctgcacaaggcacacc ttcagaaataagctttccaaatg 26868 



Score = 260 bits (131), Expect = 5e-66 
Identities = 134/135 (99%) 
Strand = Plus / Plus 



Query : 2170 attt tec taggcaatgtccctgtgggagggattttggcttccatatatttgcctaaatca 2229 

1 1 1 1 1 1 llllll 1 1 II 1 1 1 1 1 1 1 1 1 I M II II I III II I II 1 1 M II I II I II MM II I 

Sbjct : 281 attttcctaggcaatgtccctgtgggagggattttggcttccatatatttgcctaaatca 340 



Query: 
Sbjct : 



223 0 ctgacggagagaattcctcttagcaacttacaaccgatcttgtttaatttctttggccaa 22 89 

1 1 M 1 1 1 1 1 M II 1 1 1 1 II 1 1 M 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 M 1 1 1 II M 1 1 1 M 

341 ctgacggagagaattcctcttagcaacttacaaacgatcttgtttaatttctttggccaa 400 



Query: 2290 acttcactctttaag 2304 

IMIIIIIMMIII 

Sbjct: 401 acttcactctttaag 415 



Score = 236 bits (119), Expect = 8e-59 
Identities = 122/123 (99%) 
Strand = Plus / Plus 



Query: 2574 ggatt tatccaggtctacagtggattcagtgaatgaacagatattagcgcttataacata 2633 

III Mill II llllll I Ml II II I II MM Ml lllllll MM I II II lllllll III 

Sbjct: 10421 ggatttatccaggtctacagtggattcagtgaatgaacagatattagcgcttataacata 10480 



Query: 2 634 caccggatgtggaatctcctccattttcctgggagttgcagtggtgacatacatagcttt 2 693 

IN llll 1 1 II II M II II I Ml II II M I II I II I Mill MMI I I M II II I M II 

Sbjct: 10481 caccggatgtggaatctcctccatttttctgggagttgcagtggtgacatacatagcttt 10540 

Query: 2694 tea 2696 

III 

Sbjct: 10541 tea 10543 



Score = 212 bits (107), Expect = le-51 
Identities = 121/125 (96%), Gaps = 3/125 (2%) 
Strand = Plus / Plus 

Query: 23 03 agaccaaaaatgtcactaaagcattaaccacatatgttgtgagtgccagcatttc--a-g 23 59 

IMIIIMIIIIIIIIMIIIIMIIIIIII I IN II Mill II I II I ! II Ml I I 

Sbjct: 4771 agaccaaaaatgtcactaaagcattaaccacctatgttgtgagtgccagcatttcagatg 4830 
Query: 23 60 atatgt tcattcaaaacttagctgacccagtggttatcactctgcagcatattggaggaa 2419 

1 1 u 1 1 1 1 1 1 1 1 1 1 fi 1 1 1 1 1 M 1 1 1 1 it E 1 1 1 1 1 1 1 1 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 1 ll 1 1 1 1 

Sbjct: 4831 atatgttcattcaaaacttagctgacccagtggttatcactctgcagcatattggaggaa 4890 
Query: 2420 accag 2424 

Mill 

Sbjct: 4891 accag 4895 



Score = 208 bits (105), Expect = 2e-50 
Identities = 105/105 (100%) 
Strand = Plus / Plus 

Query: 2471 atgggctgggtggatggaattcgtcaggctgtaaagtaaaggaaacaaatgtaaattaca 2530 

E L 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 J 1 1 1 1 1 

Sbjct: 7931 atgggctgggtggatggaat tcgtcaggctgtaaagtaaaggaaacaaatgtaaat taca 7990 
Query: 2 531 caatctgtcagtgtgaccacctcacccat tttggagtcttaatgg 2 57 5 

1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 E 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 

Sbjct: 7991 caatctgtcagtgtgaccacctcacccattttggagtcttaatgg 8035 



Score = 206 bits (104), Expect = 7e-50 
Identities = 104/104 (100%) 
Strand = Plus / Plus 

Query: 3324 aggattcttcatttttgtgtttcactgtgtgatgaaggagagtgtgcgggagcagtggca 3383 

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! 1 1 1 1 11 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ! m 1 1 1 1 1 1 1 1 1 1 

Sbjct: 24804 aggattc ttcatttttgtgt t tcactgtgtgatgaaggagagtgtgcgggagcagtggca 24863 



Query: 3384 gatacacctctgctgtgggtggttgcgattggataactcttctg 3427 

I I I I II I I I I I I I I I I I I II I I I I I I I I I I I I I I I I I I I I I I I I 

Sbjct: 24864 gatacacctctgctgtgggtggttgcgattggataactcttctg 24907 



Score = 159 bits (80) , Expect = le-35 
Identities = 80/80 (100%) 
Strand = Plus / Plus 

Query: 2965 ggaatcccggctatcatggtggcaatcacagtcagtgtgaaaaaagatctgtatggaact 3 024 

lllllllllllllllllllllllllllllllllllll llllllllllll Ml II M I M ! 

Sbjct : 15784 ggaatcccggctatcatggtggcaatcacagtcagtgtgaaaaaagatctgtatggaact 15843 
Query: 3025 ctgagcccaacaactccgtt 3044 

I iiiiiiiiiiiiiiii mi 

Sbjct: 15844 ctgagcccaacaactccgtt 15863 



Score = 99.6 bits (50), Expect = le-17 
Identities = 50/50 (100%) 
Strand = Plus / Plus 



Query : 2421 ccagaattatggtcaagttcactgtgccttttgggatt ttgagaataata 2470 

1 1 Mill lllllllllll I III MM II III II Ml II 1 1 III 1 1 1 II 1 1 

Sbjct : 6072 ccagaattatggtcaagttcactgtgcc ttttgggattt tgagaataata 6121 



Score = 73.9 bits (37), Expect = 7e-10 
Identities = 38/39 (97%) 
Strand = Plus / Plus 



Query : 3595 gatgacyttgacaaagatccttactgttcctctccttga 3633 

Mill III E i I I i I I I I I t I I I I I i I I I I I I I 

Sbjct: 28994 gatgactttgacaaagatccttactgttcctctccttga 29032 
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Range: from Ibegin | to lend I [1 Reverse complemented strand 
□ 1: AL161778 . Reports Human DNA sequenc...[gi: 11 830771] 



Features: □ SNP □ GDD BmGC 



Links 



LOCUS 

DEFINITION 



ACCESSION 
VERSION ' 
KEYWORDS 
SOURCE 

ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



COMMENT 



AL161778 64150 bp DNA linear PRI 06-JAN-2005 

Human DNA sequence from clone RP11-51D15 on chromosome Xq26.2-27.3 
Contains part of the GPR112 gene for G protein-coupled receptor 
112, complete sequence. 
AL161778 

AL161778 . 19 GI : 1183 0771 
HTG; G protein; GPR112 . 
Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 64150) 
Wray,P. 

Direct Submission 

Submitted ( 21-DEC-2 004 ) Wellcome Trust Sanger Institute, Hinxton, 
Cambridgeshire, CB10 ISA, UK. E-mail enquiries: vega@sanger.ac.uk 
Clone requests: clonerequest@sanger.ac.uk 

On Dec 14, 2000 this sequence version replaced gi : 11595372 . 
The following abbreviations are used to associate primary accession 
numbers given in the feature table with their source databases: 
Em:, EMBL; Sw: , SWISSPROT; Tr : , TREMBL; Wp : , WORMPEP; Information 
on the WORMPEP database can be found at 

http: //www. sanger . ac . uk/Proj ects/C_elegans/wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome X, constructed by the Sanger Centre Chromosome X Mapping 
Group. Further information can be found at 
http : / /www. sanger . ac . uk/HGP/ChrX 

RP11-51D15 is. from the library RPCI-11.1 constructed by the group 
of Pieter de Jong. For further details see 
http : //www. chori . org/bacpac/home . htm 
VECTOR : pBACe3 . 6 

Genome Center 

Center: Wellcome Trust Sanger Institute 
Center code : SC 

Web site: http : / /www . sanger .ac.uk 
Contact : vega@sanger .ac.uk 



FEATURES 

source 



This sequence was finished as follows unless otherwise noted: all 
regions were either double-stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30) ; an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one subclone; and the assembly was confirmed by restriction digest, 
except on the rare occasion of the clone being a YAC . 

Location/Qualif iers 

1. .64150 



http://www.ncbi.nlm.nih. go v/entrez/viewer.fcgi?db=nucleotide&val= 11 830771 
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pispl%>| j Gen Bank 

begin j to jend j □ Reverse complemented strand Features: □ SNP □ CDD Fi MGC 

□ 1: AL1361 67 . Reports Human DNA sequenc...[gi:7 159407] 



LOCUS 

DEFINITION 



ACCESSION 
VERSION 
KEYWORDS 
SOURCE . 
ORGANISM 



REFERENCE 
AUTHORS 
TITLE 
JOURNAL 



COMMENT 



AL136167 63946 bp DNA linear PRI 06-JAN-2005 

Human DNA sequence from clone RP1-299I16 on chromosome Xq26.1-27.3 
Contains part of the GPR112 gene for G protein-coupled receptor 
112, complete sequence. 
AL136167 

AL13 6167 .8 GI: 7159407 
HTG; G protein; GPR112 . 
Homo sapiens (human) 
Homo sapiens 

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi ; 
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
1 (bases 1 to 63946) 
Chapman , J . 
Direct Submission 

Submitted (21-DEC-2004-) Wellcome Trust Sanger Institute, Hinxton, 
Cambridgeshire, CB10 ISA, UK. E-mail enquiries: vega@sanger.ac.uk 
Clone requests: clonerequest@sanger.ac.uk 

On Mar 6, 2000 this sequence version replaced gi : 7106560 . 
The following abbreviations are used to associate primary accession 
numbers given in the feature table with their source databases: 
Em:, EMBL; Sw : , SWISSPROT; Tr : , TREMBL; Wp : , WORMPEP; Information 
on the WORMPEP database can be found at 

http : / /www. Sanger . ac . uk/Projects/ C_e 1 egan s / wormpep This sequence 
was generated from part of bacterial clone contigs of human 
chromosome X, constructed by the Sanger Centre Chromosome X Mapping 
Group. Further information can be found at 
http : / /www . Sanger . ac . uk/HGP/ChrX 

RP1-299I16 is from the library RPCI-1 constructed by the group of 
Pieter de Jong. For further details see 
http : / /www . chori . org/bacpac/home . htm 
VECTOR: pCYPAC2 

Genome Center 

Center: Wellcome Trust Sanger Institute 
Center code : SC 

Web site: http : / /www. Sanger . ac . uk 
Contact: vega@sanger.ac.uk 



FEATURES 

source 



■ 

This sequence was finished as follows unless otherwise noted: all 
regions were either double-stranded or sequenced with an alternate 
chemistry or covered by high quality data (i.e., phred quality >= 
30) ; an attempt was made to resolve all sequencing problems, such 
as compressions and repeats; all regions were covered by at least 
one subclone; and the assembly was confirmed by restriction digest, 
except on the rare occasion of the clone being a YAC . 

Location /Qualifiers 

1.. 63946 



Links 



http://www.ncbi.nlm.nih. gov/entrez/viewer.fcgi?db=nucleotide&val=7 159407 
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Defective images within this document are accurate representations of the original 
documents submitted by the applicant. 

Defects in the images include but are not limited to the items checked: 

Gl^JLACK BORDERS 

(3 IMAGE CUT OFF AT TOP, BOTTOM OR SIDES 

□ FADED TEXT OR DRAWING 
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□ SKEWED/SLANTED IMAGES 

□ color OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

^LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCED) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY 

□ OTHER: 
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As rescanning these documents will not correct the image 
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